Is HBase batch put put(List<Put>) faster than put(Put)? What is the capacity of a Put object?

十年热恋 提交于 2020-01-02 03:26:10

问题


I am working on a batch job to process a batch of Put objects into HBase through HTableInterface. There are two API methods, HTableInterface.put(List) and HTableInterface.put(Put).

I am wondering, for the same number of Put objects, is the batch put faster than putting them one by one?

Another question is, I am putting a very large Put object, which caused the job to fail. There seems a limit on the size of a Put object. How large can it be?


回答1:


put(List<Put> puts) or put(Put aPut) are the same under the hood. They both call doPut(List<Put> puts).

What matters is the buffer size as mentioned by @ozhang. e.g. The default value is 2MB.

<property>   
     <name>hbase.client.write.buffer</name>
     <value>2097152</value> 
</property>

There will be 1 RPC every time the write buffer is filled up and a flushCommits() is triggered. So if your application is flushing to often because your objects are relatively big, experimenting with increasing the write buffer size will solve the problem.




回答2:


If your key value size is large, then using list of puts may have a client side buffer size problem.

<property>   
    <name>hbase.client.write.buffer</name>
    <value>20971520</value> 
</property>

Client collects upto 2mb data by default and then flushes it. So you also have to increase this value




回答3:


For batch puts it's better if you construct a list of puts and then call HTableInterface.put(List<Put> puts) because it uses a single RPC call to commit the batch, but depending on the size of the list write buffer may flush it all or not




回答4:


You will definitely save on the overhead of multiple RPC requests versus one by using put(List puts) method.

About the very large Put object: there is a limitation by default on maximum KeyValue size of 10MB. I think you have to increase that to store bigger KeyValue objects.

hbase.client.keyvalue.maxsize

Specifies the combined maximum allowed size of a KeyValue instance. This is to set an upper boundary for a single entry saved in a storage file. Since they cannot be split it helps avoiding that a region cannot be split any further because the data is too large. It seems wise to set this to a fraction of the maximum region size. Setting it to zero or less disables the check.

Default: 10485760




回答5:


Please note that this is deprecated

p̶u̶t̶(̶L̶i̶s̶t̶<̶P̶u̶t̶>̶ ̶p̶u̶t̶s̶)̶ ̶o̶r̶ ̶p̶u̶t̶(̶P̶u̶t̶ ̶a̶P̶u̶t̶)̶ ̶a̶r̶e̶ ̶t̶h̶e̶ ̶s̶a̶m̶e̶ ̶u̶n̶d̶e̶r̶ ̶t̶h̶e̶ ̶h̶o̶o̶d̶.̶ ̶T̶h̶e̶y̶ ̶b̶o̶t̶h̶ ̶c̶a̶l̶l̶ ̶d̶o̶P̶u̶t̶(̶L̶i̶s̶t̶<̶P̶u̶t̶>̶ ̶p̶u̶t̶s̶)̶.̶

There is new implements now. org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.SingleServerRequestRunnable#run MultiServerCallable

Callable that handles the multi method call going against a single regionserver

So I think the answer for your first question is yes.

I will verify it by benchmark test sometime.



来源:https://stackoverflow.com/questions/28754077/is-hbase-batch-put-putlistput-faster-than-putput-what-is-the-capacity-of

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!