HBase 快速指南

HBase 作为大数据平台中非常知名的 NoSQL 数据库之一,经过长时间的发展已经非常强大和稳定。强大的同时也意味着复杂,光把官方的入门指南看完都已经是非常不容易的事情,本文力图从实践的角度出发,让大家能够快速把 HBase『用』起来。


更新历史

  • 2017.03.13: 完成初稿

写在前面

HBase 是一个开源的、分布式的、数据多版本的,列式存储的 NoSQL 数据库(基于 Google 的 Bigtable 论文)。依托 Hadoop 的分布式文件系统 HDFS 作为底层存储, 能够为数十亿行数百万列的海量数据表提供随机、实时的读写访问。

因为现在创业公司,所以从运维到开发都要自己来,针对 DevOps 中的各种奇怪问题,是需要有一个简单明了的快速指南方便查阅的。而且随着 HBase 的版本迭代,老的中文文档已经跟不上时代,需要基于新版本做一定的更新。本指南所测试的平台及软件配置如下

  • HBase 版本: 1.2.1
  • Hadoop 版本: 2.5.1
  • ZooKeeper 客户端版本: 3.4.6
  • 集群规模: 3 台云服务器

注:安装配置见参考链接中的官方文档,这里不赘述

日常运维

日常运维需要知道的内容如下,只要熟悉了就可以

  • 检查 HDFS 状况 hdfs dfsadmin -report
  • 启动/停止 HBase 的 Master 节点 $bin/hbase-daemon.sh [start|stop] master[start|stop] 表示这两个词任选一个,对应不同操作,后同)
  • 启动/停止 RegionServer $bin/hbase-daemon.sh [start|stop] regionserver
  • 启动 HBase 的 Rest 服务 $bin/hbase-daemon.sh start rest -p 16080(端口号随意,默认为 8080,可以通过 netstat -tln 查看在使用的端口号)
    • $bin/hbase rest start -p 16080 前台启动
  • 停止 HBase 的 Rest 服务 $bin/hbase-daemon.sh stop rest
    • $bin/hbase rest stop
  • 进入 HBase Shell hbase shell
  • HBase Web UI localhost:16010(如果有暴露公网 IP,在公网也可以访问)
  • HBase Region Server UI localhost:16030

HBase Shell 常用命令

  • 如果在 hbase shell 中输入 status 有 dead 的机器,那么需要去找到并启动起来
  • 使用 help 了解命令详情
  • 使用脚本的方式 hbase shell filename

更多

RESTful API 使用

Header 中设置 Accept 可以选择返回格式,比如 Accept: text/xml 就是返回文本,Accept: application/json 就是返回 Json

集群

  • [GET] /version/cluster 查询 HBase 版本信息
  • [GET] /status/cluster 查询集群状态
  • [GET] / 获取所有用户的表名
  • [GET] /table_name/rowkey 获取对应的条目(BASE64 编码)

更多

Schema 设计

Rowkey 和 Schema 设计得好坏,有很多时候是有决定性意义的。

基本概念

  • 每张表只有一个索引,这个索引就是 rowkey,并没有二级索引
  • 行按照 rowkey 排序,rowkey 按照字典序排序
  • 行级别的操作都是原子的(更新两行可能出现一行成功一行失败)
  • 读写应该均匀分布到整个表
  • 通常,应该把单个实体的所有信息保存在单行中
  • 相关的实体应该存储在临近的行(读写更高效,压缩数据更有效)
  • HBase 表是稀疏的。空的列不会占据任何空间
  • 确定 rowkey,最有效的查询就是使用 rowkey 获取特定行,或者多行数据,其他类型查询会触发全表扫描,效率比较低下(具体设计应该基于需求)
  • 让 rowkey 尽可能短,长的 rowkey 会占据更多空间,也会增加 HBase 的响应时间
  • 反转域名。如果你要存储的数据可表示为域名的形式,可以考虑反转域名作为 rowkey 。(例如:com.example.products)反转域名是很不错的方法,尤其是相邻域名相似度很大, 同时这种方法也便于更有效的压缩数据。
  • 字符串标识符。如果你要存储的数据可以被一个字符串简单的标识(例如: user IDs),你可以使用标识符作为 rowkey 或是 rowkey 的一部分。考虑使用 标识符的哈希而不是它本身,以便 标识符的长度永远可以预测。
  • 时间戳。如果你经常需要基于时间读取数据,将时间戳作为 rowkey 的一部分是不错的方法。不推荐直接使用时间戳本身作为 rowkey,因为会造成热点问题,同样的原因,也不应该将时间戳作为 rowkey 的开始。例如:你的应用可能需要记录性能相关的数据,例如每秒的CPU,内存使用情况。对于这种情况,你可以将数据类型的标识符和时间戳组合为 rowkey.(例如:cpu#1425330757685)如果你需要经常获取最新的记录,你可以使用相反时间戳作为 rwokey,用你的编程语言中长整形的最大值(in Java, java.lang.Long.MAX_VALUE)减去现在的时间戳。有了相反的时间戳,记录将会按照从新到旧的顺序排列。
  • 避免这样的 rowkey:域名、连续的数字(应该反转用户数字 ID)、频繁更新的标识符(应该先将数据保存在内存中,再周期性批量写入 HBase)
  • 高表与宽表的不同
  • columnfamily 尽量少,多了会互相影响
  • 对于列相对固定的应用,最好采用将一行记录封装到一个 column 中的方式,可以节省存储空间。封装方式推荐 protocolbuffer
  • column 数目不要超过百万
  • 利用好 scan(startkey, endkey) 这种方式
  • 设计原则 Denormalization, Duplication 和 Intelligent Keys
  • rowkey 和 column name 都要尽可能短
  • 考虑用字节来表示编号
  • 可以设置 column family 的存活时间 TTL
  • rowkey 可以进行 hash 和编号,这样把变长搞成定长,然后原始值可以放到 column 中,不影响,主要是方便 scan
  • 新版本支持 reverse scan,所以对于 schema 的设计不用特意倒过来
  • Rowkeys cannot be changed. The only way they can be “changed” in a table is if the row is deleted and then re-inserted. This is a fairly common question on the HBase dist-list so it pays to get the rowkeys right the first time (and/or before you’ve inserted a lot of data).
  • Keep your column family names as short as possible. The column family names are stored for every value (ignoring prefix encoding). They should not be self-documenting and descriptive like in a typical RDBMS.
  • If you are storing time-based machine data or logging information, and the row key is based on device ID or service ID plus time, you can end up with a pattern where older data regions never have additional writes beyond a certain age. In this type of situation, you end up with a small number of active regions and a large number of older regions which have no new writes. For these situations, you can tolerate a larger number of regions because your resource consumption is driven by the active regions only.
  • If only one column family is busy with writes, only that column family accomulates memory. Be aware of write patterns when allocating resources.
  • Try to make do with one column family if you can in your schemas. Only introduce a second and third column family in the case where data access is usually column scoped; i.e. you query one column family or the other but usually not both at the one time.
  • To prevent hotspotting on writes, design your row keys such that rows that truly do need to be in the same region are, but in the bigger picture, data is being written to multiple regions across the cluster, rather than one at a time. Some common techniques for avoiding hotspotting are described below, along with some of their advantages and drawbacks.

技巧

  • Salting. salting attempts to increase throughput on writes, but has a cost during reads.
  • Hashing. Instead of a random assignment, you could use a one-way hash that would cause a given row to always be “salted” with the same prefix, in a way that would spread the load across the RegionServers, but allow for predictability during reads. Using a deterministic hash allows the client to reconstruct the complete rowkey and use a Get operation to retrieve that row as normal.
  • Reversing the Key. A third common trick for preventing hotspotting is to reverse a fixed-width or numeric row key so that the part that changes the most often (the least significant digit) is first. This effectively randomizes row keys, but sacrifices row ordering properties.

关于 Secondary Index

This section could also be titled “what if my table rowkey looks like this but I also want to query my table like that.” A common example on the dist-list is where a row-key is of the format “user-timestamp” but there are reporting requirements on activity across users for certain time ranges. Thus, selecting by user is easy because it is in the lead position of the key, but time is not.

  • 使用 Query Filter,过滤掉部分数据
  • Periodic-Update Secondary Index, A secondary index could be created in another table which is periodically updated via a MapReduce job. The job could be executed intra-day, but depending on load-strategy it could still potentially be out of sync with the main data table.
  • Dual-Write Secondary Index, Another strategy is to build the secondary index while publishing data to the cluster (e.g., write to data table, write to index table). If this is approach is taken after a data table already exists, then bootstrapping will be needed for the secondary index with a MapReduce job (see secondary.indexes.periodic).
  • Summary Tables, Where time-ranges are very wide (e.g., year-long report) and where the data is voluminous, summary tables are a common approach. These would be generated with MapReduce jobs into another table.
  • Coprocessor Secondary Index, Coprocessors act like RDBMS triggers. These were added in 0.92. For more information, see coprocessors

How I Learned To Stop Worrying And Love Denormalization

  • Schema Design 实际上就是 Logical Data Modeling + Physical Implementation
  • Relational DBs work well because they are close to the pure logical model
  • Any byte array in HBase can represent more than one logical attribute
  • The basic concept of denormalization is simple: two logical entities share one physical representation
  • This is a fundamental modeling property of HBase: nesting entities。即,在 rowkey 对应的一行中,某个 column 可以是另一个 rowkey,值对应是另一个 entity

  • The row key design is the single most important decision you will make. This is also true for the “key” you’re putting in the column family name of nested entities.
  • Design for the questions not the answers.
  • Bu isn’t NoSQL more flexible than a relational DB? For column schema? YES! For row key structures, NO!
  • Be compact. You can squeeze a lot into a little space
  • Use row atomicity as a design tool. Rows are updated atomically, which gives you a form of relational integrity in HBase
  • Attributes can move into the row key. Even if it’s not “identifying” (part of the uniqueness of an entity), adding an attribute into the row key can make access more efficient in some cases.
  • If you nest entities, you can transactionally pre-aggregate data. You can recalculate aggregates on write, or periodically with a map/reduce job.

HBase Schema Design and Cluster Sizing Notes

ApacheCon Europe, November 2012, by Lars George, Director EMEA Services @Cloudera, Apache Committer, lars@cloudera.com, @larsgeorge(twitter)

HBase Architecture

HBase Tables 每行由 Row Keys 索引,每列名称为 Column Names(Column Qualifiers, Column Keys),由 Row Key 和 Column Name 可以确定一个 cell,每个 cell 包含不同的数据版本。而不同的 row 可能在不同的 region 中,不同的 column 可能在不同的 column family 中。

  • Table is made up of any number of regions
  • Region is specified by its startKey and endKey
  • Each region may live on a different node and is made up of several HDFS files and blocks, each of which is replicated by Hadoop
  • Tables are sorted by Row in lexicographical order
  • Table schema only defines its column families
    • Each family consists of any number of columns
    • Each column consists of any number of versions
    • Columns only exist when inserted, NULLs are free
    • Columns within a family are sorted and stored together
    • Everything except table names are byte[]
  • HBase uses HDFS(or similar) as its reliable storage layer
    • Handles checksums, replication, failover
  • Native Java API, Gateway for REST, Thrift, Avro
  • Master manages cluster
  • RegionServer manage data
  • Zookeeper is used the “neural network”
    • Crucial for HBase
    • Bootstraps and coordinates cluster
  • Based on Log-Structured Merge-Trees (LSM-Trees)
  • Inserts are done in write-ahead log first
  • Data is stored in memory and flushed to disk on regular intervals or based on size
  • Small flushes are merged in the background to keep number of files small
  • Reads red memory stores first and then disk based files second
  • Deletes are handled with “tombstone” markers
  • Atomicity on row level no matter how many columns
    • keeps locking model easy

MemStores

  • After data is written to the WAL the RegionServer saves KeyValues in memory store
  • Flush to disk based on size, see hbase.hregion.memstore.flush.size
  • Default size is 64MB
  • Uses snapshot mechanism to write flush to disk while still serving from it and accepting new data at the same time
  • Snapshots are released when flush has succeeded

Compactions

  • General Concepts
    • Two types: Minor and Major Compactions
    • Asynchronous and transparent to client
    • Manage file bloat from MemStore flushes
  • Minor Compactions
    • Combine last “few” flushes
    • Triggered by number of storage files
  • Major Compactions
    • Rewrite all storage files
    • Drop deleted data and those values exceeding TTL and/or number of versions
    • Triggered by time threshold
    • Cannot be scheduled automatically starting at a specific time(bummer!)
    • May(most definitely) tax overall HDFS IO performance
  • Tip: Disable major compactions and schedule to run manually(e.g. cron) at off-peak times

Block Cache

  • Acts as very large, in-memory distributed cache
  • Assigned a large part of JVM heap in the RegionServer process, see hfile.block.cache.size
  • Optimized reads on subsequent columns and rows
  • Has priority to keep “in-memory” column families in cache
  • Cache needs to be used properly to get best read performance
    • Turn off block cache on operations that cause large churn
    • Store related data “close” to each other
  • Uses LRU cache with threaded (asynchronous) evictions based on priorities

Region Splits

  • Triggered by configured maximum file size of any store file
    • This is checked directly after the compaction call to ensure store files are actually approaching the threshold.
  • Runs as asynchronous thread on RegionServer
  • Splits are fast and nearly instant
    • Reference files point to original region files and represent each half of the split
  • Compactions take care of splitting original files into new region directories

Auto Sharding and Distribution

  • Unit of scalability in HBase is the Region
  • Sorted, contiguous range of rows
  • Spread “randomly” across RegionServer
  • Moved around for load balancing and failover
  • Split automatically or manually to scale with growing data
  • Capacity is solely a factor of cluster nodes vs. regions per node

Column Family vs. Column

  • Use only a few column families
    • Causes many files that need to stay open per region plus class overhead per family
  • Best used when logical separation between data and meta columns
  • Sorting per family can be used to convey application logic or access pattern

Storage Separation

  • Column Families allow for separation of data
    • Used by Columnar Databases for fast analytical queries, but on column level only
    • Allows different or compression depending on the content type
  • Segregate information based on access pattern
  • Data is stored in one or more storage file, called HFiles

Schema Design

Key Cardinality(基数)

  • The best performance is gained from using row keys
  • Time range bound reads can skip store files
    • So can Bloom Filter
  • Selecting column families reduces the amount of data to be scanned
  • Pure value based filtering is a full table scan
    • Filters often are too, but reduce network traffic

Fold, Store, and Shift

  • Logical layout does not match physical one
  • All values are stored with the full coordinates, including: Row Key, Column Family, Column Qualifier, and Timestamp
  • Folds columns into “row per column”
  • NULLs are cost free as nothing is stored
  • Versions are multiple “rows” in folded table

Key/Table Design

  • Crucial to gain best performance
    • Why do I need to know? Well, you also need to know that RDBMS is only working well when columns are indexed and query plan is OK
  • Absence of secondary indexes forces use of row key or column name sorting
  • Transfer multiple indexes into on
    • Generate large table -> Good since fits architecture and spreads across cluster

DDI

  • Stands for Denormalization, Duplication and Intelligent Keys
  • Needed to overcome shortcomings of architecture
  • Denormalization -> Replacement for JOINs
  • Duplication -> Design for reads
  • Intelligent Keys -> Implement indexing and sorting, optimize reads

Pre-materialized Everything

  • Achieve one read per customer request is possible
  • Otherwise keep at lowest number
  • Reads between 10ms(cache miss) and 1ms(cache hit)
  • Use MapReduce to compute exacts in batch

Motto: Design for Reads!!!

Tall-Narrow vs. Flat-Wide Tables

  • Rows do not split
    • Might end up with one row per region
  • Same storage footprint
  • Put more details into the row key
    • Sometimes dummy column only
    • Make use of partial key scans
  • Tall with Scans, Wide with Gets
    • Atomicity only on row level
  • Example: Large graphs, stored as adjacency matrix

Hashing vs. Sequential Keys

  • Uses hashes for best spread
    • Use for example MD5 to be able to recreate key, Key = MD5(customerID)
    • Counter productive for range scans
  • Use sequential keys for locality
    • Makes use of block caches
    • May tax one server overly, may be avoided by salting or splitting regions while keeping them small

Key Design Summary

  • Based on access pattern, either use sequential or random keys
  • Often a combination of both is needed
    • Overcome architectural limitations
  • Neither is necessarily bad
    • Use bulk import for sequential keys and reads
    • Random keys are good for random access patterns

Key Design

  • Reversed Domains
    • Examples: “com.wdxtub.www”, “cn.wdxtub.www”
    • Helps keeping pages per site close, as HBase efficiently scans blocks of sorted keys
  • Domain Row Key = MD5(Reversed Domain) + Reversed Domain
    • Leading MD5 hash spreads keys randomly across all regions for load balancing reasons
    • Only hashing the domain groups per site (and per subdomain if needed)
  • URL Row Key = MD5(Reversed Domain) + Reversed Domain + URL ID
    • Unique ID per URL already available, make use of it

Summary

  • Design for Use-Case
    • Read, Write, or Both?
  • Avoid Hotspotting
  • Consider using IDs instead of full text
  • Leverage Column Family to HFile relation
  • Shift details to appropriate position
    • Composite Keys
    • Column Qualifiers
  • Schema design is a combination of
    • Designing the keys (row and column)
    • Segregate data into column families
    • Choose compression and block sizes
  • Similar techniques are needed to scale most systems
    • Add indexes, partition data, consistent hashing
  • Denormalization, Duplication, and Intelligent Keys(DDI)

Cluster Sizing Notes

Competing Resources

  • Reads and Writes compete for the same low-level resources
    • Disk(HDFS) and Network I/O
    • RPC Handlers and Threads
  • Otherwise the do exercise completely separate code paths

Memory Sharing

  • By default every region server is dividing its memory (i.e. given maximum heap) into
    • 40% for in-memory stores (write ops)
    • 20% for block caching (read ops)
    • remaining space (here 40%) go towards usual Java heap usage(objects etc.)
    • Share of memory needs to be tweaked

Reads

  • Locate and route request to appropriate region server
    • Client caches information for faster lookups -> consider prefetching option for fast warmups
  • Eliminate store files if possible using time ranges or Bloom filter
  • Try block cache, if block is missing then load from disk

Block Cache

  • Use exported metrics to see effectiveness of block cache
    • Check fill and eviction rate, as well as hit ratios -> random reads are not ideal
  • Tweak up or down as needed, but watch overall heap usage
  • You absolutely need the block cache
    • Set to 10% at least for short term benefits

Writes

  • The cluster size is often determined by the write performance
  • Log structured merge trees like
    • Store mutation in in-memory store and write-ahead log
    • Flush out aggregated, sorted maps at specified threshold - or - when under pressure
    • Discard logs with no pending edits
    • Perform regular compactions of store files

Write Performance

  • There are many factors to the overall write performance of a cluster
    • Key Distribution -> Avoid region hotspot
    • Handlers -> Do not pile up too early
    • Write-ahead log -> Bottleneck #1
    • Compactions -> Badly tuned can cause ever increasing background noise

Write-Ahead Log

  • Currently only one per region server
    • Shared across all stores (i.e. column families)
    • Synchronized on file append calls
  • Work being done on mitigating this
    • WAL compression
    • Multiple WAL’s per region server -> Start more than one region server per node?
  • Size set to 95% of default block size
    • 64MB or 128MB, but check config!
  • Keep number low to reduce recovery time
    • Limit set to 32, but can be increased
  • Increase size of logs - and/or - increase the number of logs before blocking
  • Compute number based on fill distribution and flush frequencies
  • Writes are synchronized across all stores
    • A large cell in one family can strop all writes of another
    • In this case the RPC handlers go binary, i.e. either work or all block
  • Can be bypassed on writes, but means no real durability and no replication
    • Maybe use coprocessor to restore dependent data sets(preWALRestore)

Flushes

  • Every mutation call (put, delete etc.) cause a check for a flush
  • If threshold is met, flush file to disk and schedule a compaction
    • Try to compact newly flushed files quickly
  • The compaction returns - if necessary - where a region should be split

Compaction Storms

  • Premature flushing because of # of logs or memory pressure
    • Files will be smaller than the configured flush size
  • The background compactions are hard at work merging small flush files into the existing, larger store files
    • Rewrite hundreds of MB over and over

Dependencies

  • Flushes happen across all stores/column families, even if just one triggers it
  • The flush size is compared to the size of all stores combined
    • Many column families dilute the size
    • Example: 55MB + 5MB + 4MB

Some Numbers

  • Typical Write performance of HDFS is 35-50MB/s
Cell Size OPS
0.5MB 70-100
100KB 350-500
10KB 3500-5000 ??
1KB 35000-50000 ????

?? - This is Way to high in practice - Contention

  • Under real world conditions the rate is less, more like 15MB/s or less
    • Thread contention is cause for massive slow down
Cell Size OPS
0.5MB 10
100KB 100
10KB 800
1KB 6000

Notes

  • Compute memstore sizes based on number of regions x flush size
  • Compute number of logs to keep based on fill and flush rate
  • Ultimately the capacity is driven by

    • Java Heap
    • Region Count and Size
    • Key Distribution

    Cheat Sheet

    • Ensure you have enough or large enough write-ahead logs
    • Ensure you do not oversubscribe available memstore space
    • Ensure to set flush size large enough but not too large
    • Check write-ahead log usage carefully
    • Enable compression to store more data per node
    • Tweak compaction algorithm to peg background I/O at some level
    • Consider putting uneven column families in separate tables
    • Check metrics carefully for block cache, memstore, and all queues

Example

  • Java Xmx heap at 10GB
  • Memstore share at 40% (default)
    • 10GB Headp x 0.4 = 4GB
  • Desired flush size at 128MB
    • 4GB / 128MB = 32 regions max!
  • For WAL size of 128MB x 0.95%
    • 4GB / (128MB x 0.95) = ~33 partially uncommitted logs to keep around
  • Region size at 20GB
    • 20GB x 32 regions = 640 GB raw storage used

参考链接

文档

与其他数据库对比

原始论文

捧个钱场?