I want to scan lots of data (Range based queries), what all optimizations I can do while writing the data so that scan becomes faster?

▼魔方 西西 提交于 2019-12-10 19:44:52

问题


I have billion of rows in hbase I want to scan million rows at a time. what are the best optimization techniques which I can do to make this scan as fast as possible.


回答1:


We have similar problem, we need to scan million rows by keys and we used the map reduce techniques for this. There is no standard solution for this, so we write a custom input format that extends InputFormat<ImmutableBytesWritable, Result>. There is a shot description how we done this.

First you need to create a splits so keys go to machine where the region that contains it located:

public List<InputSplit> getSplits(JobContext context) throws IOException {
    context.getConfiguration();

    //read key for scan
    byte[][] filterKeys = readFilterKeys(context);

    if (table == null) {
        throw new IOException("No table was provided.");
    }

    Pair<byte[][], byte[][]> keys = table.getStartEndKeys();
    if (keys == null || keys.getFirst() == null || keys.getFirst().length == 0) {
        throw new IOException("Expecting at least one region.");
    }

    List<InputSplit> splits = new ArrayList<InputSplit>(keys.getFirst().length);
    for (int i = 0; i < keys.getFirst().length; i++) {
        //get key for current region 
        //it should lying between start and end key of region 
        byte[][] regionKeys =
                getRegionKeys(keys.getFirst()[i], keys.getSecond()[i],filterKeys);
        if (regionKeys == null) {
            continue;
        }
        String regionLocation = table.getRegionLocation(keys.getFirst()[i]).
                getServerAddress().getHostname();
        //create a split for region
        InputSplit split = new MultiplyValueSplit(table.getTableName(),
                regionKeys, regionLocation);
        splits.add(split);

    }
    return splits;
}

Class 'MultiplyValueSplit' contains information about keys and tables

public class MultiplyValueSplit extends InputSplit
    implements Writable, Comparable<MultiplyValueSplit> {

    private byte[] tableName;
    private byte[][] keys;
    private String regionLocation;
}

In method createRecordReader in input format class a 'MultiplyValueReader' that contains the logic how read value from table is created.

@Override
public RecordReader<ImmutableBytesWritable, Result> createRecordReader(
        InputSplit split, TaskAttemptContext context) throws IOException {
    HTable table = this.getHTable();
    if (table == null) {
        throw new IOException("Cannot create a record reader because of a" +
                " previous error. Please look at the previous logs lines from" +
                " the task's full log for more details.");
    }

    MultiplyValueSplit mSplit = (MultiplyValueSplit) split;
    MultiplyValuesReader mvr = new MultiplyValuesReader();

    mvr.setKeys(mSplit.getKeys());
    mvr.setHTable(table);
    mvr.init();

    return mvr;
}

Class 'MultiplyValuesReader' contains logic about how read data from HTable

public class MultiplyValuesReader 
        extends RecordReader<ImmutableBytesWritable, Result> {
    .......

    @Override
    public ImmutableBytesWritable getCurrentKey() {
        return key;
    }

    @Override
    public Result getCurrentValue() throws IOException, InterruptedException {
        return value;
    }

    @Override
    public boolean nextKeyValue() throws IOException, InterruptedException {
        if (this.results == null) {
            return false;
        }

        while (this.results != null) {
            if (resultCurrentKey >= results.length) {
                this.results = getNextResults();
                continue;
            }

            if (key == null) key = new ImmutableBytesWritable();
            value = results[resultCurrentKey];
            resultCurrentKey++;

            if (value != null && value.size() > 0) {
                key.set(value.getRow());
                return true;
            }

        }
        return false;
    }

    public float getProgress() {
        // Depends on the total number of tuples
        return (keys.length > 0 ? ((float) currentKey) / keys.length : 0.0f);
    }

    private Result[] getNextResults() throws IOException {
        if (currentKey <= keys.length) {
            return null;
        }

        //using batch for faster scan
        ArrayList<Get> batch = new ArrayList<Get>(BATCH_SIZE);
        for (int i = currentKey; 
             i < Math.min(currentKey + BATCH_SIZE, keys.length); i++) {
            batch.add(new Get(keys[i]));
        }

        currentKey = Math.min(currentKey + BATCH_SIZE, keys.length);
        resultCurrentKey = 0;
        return htable.get(batch);
    }

}

For more details you can look at source code of classes TableInputFormat, TableInputFormatBase, TableSplit and TableRecordReader.



来源:https://stackoverflow.com/questions/8427348/i-want-to-scan-lots-of-data-range-based-queries-what-all-optimizations-i-can

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!