Hadoop HDFS MapReduce output into MongoDb

感情迁移 提交于 2019-12-08 07:34:28
Simplefish

You want «MongoDB Connector for Hadoop». The examples.

It's tempting to just add code in your Reducer that, as a side effect, inserts data into your database. Avoid this temptation. One reason to use a connector as opposed to just inserting data as a side effect of your reducer class is speculative execution: Hadoop can sometimes run two of the exact same reduce tasks in parallel, which can lead to extraneous inserts and duplicate data.

Yes. You write to mongo as usual. The fact that your mongo db is set to run on shards is a detail that is hidden from you.

I spent my morning to implement the same scenario. Here my solution:

Create three classes:

  • Experiment.java: for job configuration and submission
  • MyMap.java: mapper class
  • MyReduce.java: reducer class

    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.conf.Configured;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapred.FileInputFormat;
    import org.apache.hadoop.mapred.JobClient;
    import org.apache.hadoop.mapred.JobConf;
    import org.apache.hadoop.util.Tool;
    import org.apache.hadoop.util.ToolRunner;
    
    import com.mongodb.hadoop.io.BSONWritable;
    import com.mongodb.hadoop.mapred.MongoOutputFormat;
    
    public class Experiment extends Configured implements Tool{
    
         public int run(final String[] args) throws Exception {
            final Configuration conf = getConf();
            conf.set("mongo.output.uri", args[1]);
    
            final JobConf job = new JobConf(conf);
    
            FileInputFormat.setInputPaths(job, new Path(args[0]));
            job.setJarByClass(Experiment.class);
    
            job.setInputFormat(org.apache.hadoop.mapred.TextInputFormat.class);
            job.setMapperClass(MyMapper.class);
            job.setReducerClass(MyReducer.class);
            job.setOutputFormat(MongoOutputFormat.class);
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(BSONWritable.class);
            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(IntWritable.class);
    
            JobClient.runJob(job);
    
            return 0;
        }
    
        public static void main(final String[] args) throws Exception{
    
            int res = ToolRunner.run(new TweetPerUserToMongo(), args);
            System.exit(res);
        }
    }
    

When you run Experiment class from your cluster, you will enter two parameters. First parameter is your input source from HDFS location, second parameter refers to mongodb URI that is going keep your results. Here is an example call. Assuming that your Experiment.java is under the package name org.example.

sudo -u hdfs hadoop jar ~/jar/myexample.jar org.example.Experiment myfilesinhdfs/* mongodb://192.168.0.1:27017/mydbName.myCollectionName

This might not be the best way but it does the job for me.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!