问题
I have billion of rows in hbase I want to scan million rows at a time. what are the best optimization techniques which I can do to make this scan as fast as possible.
回答1:
We have similar problem, we need to scan million rows by keys and we used the map reduce techniques for this. There is no standard solution for this, so we write a custom input format that extends InputFormat<ImmutableBytesWritable, Result>. There is a shot description how we done this.
First you need to create a splits so keys go to machine where the region that contains it located:
public List<InputSplit> getSplits(JobContext context) throws IOException {
context.getConfiguration();
//read key for scan
byte[][] filterKeys = readFilterKeys(context);
if (table == null) {
throw new IOException("No table was provided.");
}
Pair<byte[][], byte[][]> keys = table.getStartEndKeys();
if (keys == null || keys.getFirst() == null || keys.getFirst().length == 0) {
throw new IOException("Expecting at least one region.");
}
List<InputSplit> splits = new ArrayList<InputSplit>(keys.getFirst().length);
for (int i = 0; i < keys.getFirst().length; i++) {
//get key for current region
//it should lying between start and end key of region
byte[][] regionKeys =
getRegionKeys(keys.getFirst()[i], keys.getSecond()[i],filterKeys);
if (regionKeys == null) {
continue;
}
String regionLocation = table.getRegionLocation(keys.getFirst()[i]).
getServerAddress().getHostname();
//create a split for region
InputSplit split = new MultiplyValueSplit(table.getTableName(),
regionKeys, regionLocation);
splits.add(split);
}
return splits;
}
Class 'MultiplyValueSplit' contains information about keys and tables
public class MultiplyValueSplit extends InputSplit
implements Writable, Comparable<MultiplyValueSplit> {
private byte[] tableName;
private byte[][] keys;
private String regionLocation;
}
In method createRecordReader in input format class a 'MultiplyValueReader' that contains the logic how read value from table is created.
@Override
public RecordReader<ImmutableBytesWritable, Result> createRecordReader(
InputSplit split, TaskAttemptContext context) throws IOException {
HTable table = this.getHTable();
if (table == null) {
throw new IOException("Cannot create a record reader because of a" +
" previous error. Please look at the previous logs lines from" +
" the task's full log for more details.");
}
MultiplyValueSplit mSplit = (MultiplyValueSplit) split;
MultiplyValuesReader mvr = new MultiplyValuesReader();
mvr.setKeys(mSplit.getKeys());
mvr.setHTable(table);
mvr.init();
return mvr;
}
Class 'MultiplyValuesReader' contains logic about how read data from HTable
public class MultiplyValuesReader
extends RecordReader<ImmutableBytesWritable, Result> {
.......
@Override
public ImmutableBytesWritable getCurrentKey() {
return key;
}
@Override
public Result getCurrentValue() throws IOException, InterruptedException {
return value;
}
@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
if (this.results == null) {
return false;
}
while (this.results != null) {
if (resultCurrentKey >= results.length) {
this.results = getNextResults();
continue;
}
if (key == null) key = new ImmutableBytesWritable();
value = results[resultCurrentKey];
resultCurrentKey++;
if (value != null && value.size() > 0) {
key.set(value.getRow());
return true;
}
}
return false;
}
public float getProgress() {
// Depends on the total number of tuples
return (keys.length > 0 ? ((float) currentKey) / keys.length : 0.0f);
}
private Result[] getNextResults() throws IOException {
if (currentKey <= keys.length) {
return null;
}
//using batch for faster scan
ArrayList<Get> batch = new ArrayList<Get>(BATCH_SIZE);
for (int i = currentKey;
i < Math.min(currentKey + BATCH_SIZE, keys.length); i++) {
batch.add(new Get(keys[i]));
}
currentKey = Math.min(currentKey + BATCH_SIZE, keys.length);
resultCurrentKey = 0;
return htable.get(batch);
}
}
For more details you can look at source code of classes TableInputFormat, TableInputFormatBase, TableSplit and TableRecordReader.
来源:https://stackoverflow.com/questions/8427348/i-want-to-scan-lots-of-data-range-based-queries-what-all-optimizations-i-can