HBase regions automatic splitting using hbase.hregion.max.filesize

I'm using the cloudera distribution of HBase (hbase-0.94.6-cdh4.5.0) and the cloudera manager to set up all cluster's configurations.

I have set up the following property for HBase:

<property>
<name>hbase.hregion.max.filesize</name>
<value>10737418240</value>
<source>hbase-default.xml</source>
</property>

NB: 10737418240 <=> 10G

So, according to all documentation I read, data should be accumulated into a single region until the region size reached 10G.

But, it doesn't seem to work... Maybe I miss something...

Here is all regions of my hbase table and their size:

root@hadoopmaster01:~# hdfs dfs -du -h /hbase/my_table 719 /hbase/my_table/.tableinfo.0000000001 0 /hbase/my_table/.tmp 222.2 M /hbase/my_table/08e225d0ae802ef805fff65c89a15de6 602.7 M /hbase/my_table/0f3bb09af53ebdf5e538b50d7f08786e 735.1 M /hbase/my_table/1152669b3ef439f08614e3785451c305 2.8 G /hbase/my_table/1203fbc208fc93a702c67130047a1e4f 379.3 M /hbase/my_table/1742b0e038ece763184829e25067f138 7.3 G /hbase/my_table/194eae40d50554ce39c82dd8b2785d96 627.1 M /hbase/my_table/28aa1df8140f4eb289db76a17c583028 274.6 M /hbase/my_table/2f55b9760dbcaefca0e1064ce5da6f48 1.5 G /hbase/my_table/392f6070132ec9505d7aaecdc1202418 1.5 G /hbase/my_table/4396a8d8c5663de237574b967bf49b8a 1.6 G /hbase/my_table/440964e857d9beee1c24104bd96b7d5c 1.5 G /hbase/my_table/533369f47a365ab06f863d02c88f89e2 2.5 G /hbase/my_table/6d86b7199c128ae891b84fd9b1ccfd6e 1.2 G /hbase/my_table/6e5e6878028841c4d1f4c3b64d04698b 1.6 G /hbase/my_table/7dc1c717de025f3c15aa087cda5f76d2 200.2 M /hbase/my_table/8157d48f833bb3b708726c703874569d 118.0 M /hbase/my_table/85fb1d24bf9d03d748f615d3907589f2 2.0 G /hbase/my_table/94dd01c81c73dc35c02b6bd2c17d8d22 265.1 M /hbase/my_table/990d5adb14b2d1c936bd4a9c726f8e03 335.0 M /hbase/my_table/a9b673c142346014e01d7cf579b0e58a 502.1 M /hbase/my_table/ae3b1f6f537826f1bdb31bfc89d8ff9a 763.3 M /hbase/my_table/b6039c539b6cca2826022f863ed76c7b 470.7 M /hbase/my_table/be091ead2a408df55999950dcff6e7bc 5.9 G /hbase/my_table/c176cf8c19cc0fffab2af63ee7d1ca45 512.0 M /hbase/my_table/cb622a8a55ba575549759514281d5841 1.9 G /hbase/my_table/d201d1630ffdf08e4114dfc691488372 787.9 M /hbase/my_table/d78b4f682bb8e666488b06d0fd00ef9b 862.8 M /hbase/my_table/edd72e02de2a90aab086acd296d7da2b 627.5 M /hbase/my_table/f13a251ff7154f522e47bd54f0d1f921 1.3 G /hbase/my_table/fde68ec48d68e7f61a0258b7f8898be4

As you can see, there is a lot of regions and any of them has a size close to 10G...

If someone has been faced to this kind of issue or know if there is an other configuration to set up, please help me!

Thx

haydenmarchant

@mpiffaretti, what you are seeing is very valid. I also got a little shock when I saw the regions sizes after an automatic split for the first time.

In HBase 0.94+, the default split policy is IncreasingToUpperBoundRegionSplitPolicy. The region size is decided by following the algorithm described below.

Split size is the number of regions that are on this server that all are of the same table, cubed, times 2x the region flush size OR the maximum region split size, whichever is smaller. For example, if the flush size is 128M, then after two flushes (256MB) we will split which will make two regions that will split when their size is 2^3 * 128M*2 = 2048M. If one of these regions splits, then there are three regions and now the split size is 3^3 * 128M*2 = 6912M, and so on until we reach the configured maximum filesize and then from there on out, we'll use that.

This is quite a nice strategy since you start to get a nice spread of regions over the region servers without having to wait until they reach the 10GB limit.

Alternatively, you would be better off pre-splitting your tables, since you want to make sure that you are getting the most out of the processing power of your cluster - if you have a single Region, all requests will go to the Region Server to which the region is assigned. Pre-splitting outs the control into your hands of how the regions are split over the row-key space.

Pr-splitting is better option. Hope your data is not continuously inserted into a single region and on reaching region limit, does splitting or compaction.

In that condition writes are not uniformly distributed and on compaction of table becomes a bottle neck for writing modules.

No of requests on Active region will be high.

来源：https://stackoverflow.com/questions/23872556/hbase-regions-automatic-splitting-using-hbase-hregion-max-filesize

标签

Hadoop

split

hbase

region