Cassandra Map Reduce for TimeUUID columns

两盒软妹~` 提交于 2019-12-11 05:54:02

问题


I recently Setup 4 node Cassandra cluster for learning with one column family which hold time series data as.

Key -> {column name: timeUUID, column value: csv log line, ttl: 1year}, I use Netflix Astyanax java client to load about 1 million log lines.

I also configured Hadoop to run map-reduce jobs with 1 namenode and 4 datanode's to run some analytics on Cassandra data.

All the available examples on internet uses column name as SlicePredicate for Hadoop Job Configuration, where as I have timeUUID as columns how can I efficiently feed Cassandra data to Hadoop Job configurator with batches of 1000 columns at one time.

There are more than 10000 column's for some rows in this test data and expected to be more in real data.


I configure my job as

public int run(String[] arg0) throws Exception {
    Job job = new Job(getConf(), JOB_NAME);
Job.setJarByClass(LogTypeCounterByDate.class);
job.setMapperClass(LogTypeCounterByDateMapper.class);       
job.setReducerClass(LogTypeCounterByDateReducer.class);

job.setInputFormatClass(ColumnFamilyInputFormat.class);

job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(LongWritable.class);

ConfigHelper.setRangeBatchSize(getConf(), 1000);


SliceRange sliceRange = new SliceRange(ByteBuffer.wrap(new byte[0]), 
    ByteBuffer.wrap(new byte[0]), true, 1000);

SlicePredicate slicePredicate = new SlicePredicate();
slicePredicate.setSlice_range(sliceRange);


ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE, COLUMN_FAMILY);
ConfigHelper.setInputRpcPort(job.getConfiguration(), INPUT_RPC_PORT);
ConfigHelper.setInputInitialAddress(job.getConfiguration(), INPUT_INITIAL_ADRESS);
    ConfigHelper.setInputPartitioner(job.getConfiguration(), INPUT_PARTITIONER);
ConfigHelper.setInputSlicePredicate(job.getConfiguration(), slicePredicate);
FileOutputFormat.setOutputPath(job, new Path(OUTPUT_PATH));
job.waitForCompletion(true);
return job.isSuccessful() ? 0 : 1;
}

But I can't able to understand how I define Mapper, kindly can you provide template for Mapper class.

public static class LogTypeCounterByDateMapper extends Mapper<ByteBuffer, SortedMap<ByteBuffer, IColumn>, Text, LongWritable>
{
    private Text key = null;
    private LongWritable value = null;

    @Override
    protected void setup(Context context){

    }

    public void map(ByteBuffer key, SortedMap<ByteBuffer, IColumn> columns, Context context){
        //String[] lines = columns.;

    }
}

回答1:


ConfigHelper.setRangeBatchSize(getConf(), 1000)
...
SlicePredicate predicate = new SlicePredicate().setSlice_range(new SliceRange(TimeUUID.asByteBuffer(startValue), TimeUUID.asByteBuffer(endValue), false, 1000))
ConfigHelper.setInputSlicePredicate(conf, predicate)


来源:https://stackoverflow.com/questions/13330960/cassandra-map-reduce-for-timeuuid-columns

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!