问题
I am working on a batch job to process a batch of Put objects into HBase through HTableInterface. There are two API methods, HTableInterface.put(List) and HTableInterface.put(Put).
I am wondering, for the same number of Put objects, is the batch put faster than putting them one by one?
Another question is, I am putting a very large Put object, which caused the job to fail. There seems a limit on the size of a Put object. How large can it be?
回答1:
put(List<Put> puts) or put(Put aPut) are the same under the hood. They both call doPut(List<Put> puts).
What matters is the buffer size as mentioned by @ozhang. e.g. The default value is 2MB.
<property>
<name>hbase.client.write.buffer</name>
<value>2097152</value>
</property>
There will be 1 RPC every time the write buffer is filled up and a flushCommits() is triggered. So if your application is flushing to often because your objects are relatively big, experimenting with increasing the write buffer size will solve the problem.
回答2:
If your key value size is large, then using list of puts may have a client side buffer size problem.
<property>
<name>hbase.client.write.buffer</name>
<value>20971520</value>
</property>
Client collects upto 2mb data by default and then flushes it. So you also have to increase this value
回答3:
For batch puts it's better if you construct a list of puts and then call HTableInterface.put(List<Put> puts) because it uses a single RPC call to commit the batch, but depending on the size of the list write buffer may flush it all or not
回答4:
You will definitely save on the overhead of multiple RPC requests versus one by using put(List puts) method.
About the very large Put object: there is a limitation by default on maximum KeyValue size of 10MB. I think you have to increase that to store bigger KeyValue objects.
hbase.client.keyvalue.maxsize
Specifies the combined maximum allowed size of a KeyValue instance. This is to set an upper boundary for a single entry saved in a storage file. Since they cannot be split it helps avoiding that a region cannot be split any further because the data is too large. It seems wise to set this to a fraction of the maximum region size. Setting it to zero or less disables the check.
Default: 10485760
回答5:
Please note that this is deprecated
p̶u̶t̶(̶L̶i̶s̶t̶<̶P̶u̶t̶>̶ ̶p̶u̶t̶s̶)̶ ̶o̶r̶ ̶p̶u̶t̶(̶P̶u̶t̶ ̶a̶P̶u̶t̶)̶ ̶a̶r̶e̶ ̶t̶h̶e̶ ̶s̶a̶m̶e̶ ̶u̶n̶d̶e̶r̶ ̶t̶h̶e̶ ̶h̶o̶o̶d̶.̶ ̶T̶h̶e̶y̶ ̶b̶o̶t̶h̶ ̶c̶a̶l̶l̶ ̶d̶o̶P̶u̶t̶(̶L̶i̶s̶t̶<̶P̶u̶t̶>̶ ̶p̶u̶t̶s̶)̶.̶
There is new implements now.
org.apache.hadoop.hbase.client.AsyncRequestFutureImpl.SingleServerRequestRunnable#run
MultiServerCallable
Callable that handles the multi method call going against a single regionserver
So I think the answer for your first question is yes.
I will verify it by benchmark test sometime.
来源:https://stackoverflow.com/questions/28754077/is-hbase-batch-put-putlistput-faster-than-putput-what-is-the-capacity-of