Hortonworks Ted Yu HBase 1.0 and 2

By Steven Parker,2015-08-24 03:34
40 views 0
Hortonworks Ted Yu HBase 1.0 and 2

    Hortonworks Ted Yu: HBase 1.0 and 2.0, the latest


    December 12-14, 2014, sponsored by the China computer society (CCF), CCF expert committee to undertake large data, calculate and CSDN jointly sponsor of the Chinese academy of sciences, in order to promote big data as purpose of scientific research, application and industry development2014 China big data technology conference(Big Data Technology Conference 2014, BDTC 2014) and the second CCF Big Data Conference in Beijing crowne plaza hotel grand opening in yunnan province.

    On 13 of the general assembly, Hortonworks senior engineers, Apache HBase core contributors Ted Yu made the theme of "Recent Developments in the Apache HBase" sharing, introduced HBase1.0 and the latest research progress of 2.0, mainly including HBase1.0, HydraBase, Phoenix secondary index and Per column family flush.According to his introduction, a major change in the HBase1.0 including stability, availability, usability improvements, such as Master embedded RegionServer.While HydraBase provide availability of 99.99% or more, when a cluster bring down soon to second level recovery, and do not lose data, but USES a different way.

    Hortonworks senior engineers, Apache HBase core contributors Ted Yu


    Ted Yu introduced, Hbase1.0 and 0.98 there are a lot of different, embodied in the Stability, the Availability, Usability, Online config, etc.Specific as follows:

    StabilityCo-locate hbase:meta with Master

    ; To simplify and improve region the reliability of the assignment: involves fewer components

    ; Master embedded a RegionServer, only load system tables

    ; Backup masters can be configured to load the user table, the code has been preliminary tests

    AvailabilityRegion Replicas

    ; Multiple RegionServers host a Region

    ; Baby step toward quorum reads, writes

    Ted Yu explained that the benefits of Hbase is to provide the strong consistency, but only one RegionServer, it is difficult to guarantee the high reliability, software Bug, OS, hardware failure, may result in a machine can't response.Since 10070, one of the goals is to provide high reliability, Region Replicas do only a primary RegionServer, support, the rest are Replicas, set a short delay, the delay within primary RegionServer didn't answer, just send a request to Replicas, the target is to control the read response, so as to ensure consistency and reliability at the same time.

    Usability: the client API changes

    ; To improve the self - consistency

    ; To simplify the semantic

    ; @ InterfaceAudience annotation

    Among them, the new Client apis use the sample as follows:

    Connection conn =


    try {

    UserProvider userProvider =


    TokenUtil.addTokenForJob(conn, userProvider.getCurrent(), job);

    } finally {



    Online config changes: from 89 - FB HBASE - 12147

    ; Automatically adjusts the size of the global MemStore and BlockCache

; BucketCache more easy to configure

    ; Pluggable copy end point

    ; Greatly expanded

    ; Combination MVCC/seqid

    ; Involving safety, tags, labels, a variety of improvement


    HydraBase goal is a 9 usability, specific include:

    ; Cluster level failure will not result in data loss

    ; All fault should be quickly recover (in seconds)

    ; Distributed consensus should not affect the write throughput ; Each region consists of a set of region Servers to load

    Goals need to step by step implementation, mainly is the Replication Protocol and


    Replication Protocol

    ; Between a set of replicas will be only one leader

    ; Leader in response to the request of the client all read and write ; Will use the RAFT agreement to complete the leader election

    ; Each up will have their own write-ahead log, stored locally

    ; Write operations should be copied to the replicas by leader


    ; The RMAP contains quorum configuration information for each of the Region ; Based on the network delay to the client, each Data Center will have a rank ; Delay the lowest DC will have the highest levels of rank

    ; DC - rank higher ranking, have a quorum for the qualification of member will be able to take over

    the leadership

    ; Higher level (DC - rank + machine - rank) up to the optimal may become a leader Ted Yu also introduces, HydraBase cluster and more ChanJiQun deployment Settings

    are as follows:

    3, Phoenix secondary indexes

    For many people focus on secondary indexes, Ted, Yu said secondary indexes user requirements is more complex, how to meet the needs of all users still need spent much energy.He introduces, Phoenix index structure is as follows:

    4Per column family flush

    Per the column family flush progress is as follows:

    ; HBASE - 10201 from 0.89 - fb branch

    ; The write amplification by 10%

    ; To flush the size per column family lower limit

    ; FlushPolicy control if all stores are flush

    ; Using the sequence ID of each Store

    Data model problems, one of a large number of the column family in one or two column family, even the column family data relatively cold and so far all the column family have flush together, consistency is more convenient to do it, but there is a problem, the colder column family, flush will generate small files, cause some pressure to the system, namely the I/O write amplification, so want to reduce write amplification.10201 there is a designated FlushPolicy, more than the lower column family do flush, correspond to the Hbase now also have a, only corresponds to a larger column family.

    Tests run results in 10201, Per column family flush to reduce I/O below, in front of the two flush for existing for the column family all flush, red is the I/O.The last two are the column family, IO write amplification is lower than the existing way.

Report this document

For any questions or suggestions please email