MapReduce | 易学教程

how to restrict the concurrent running map tasks?

阅读更多关于 how to restrict the concurrent running map tasks?

问题 My hadoop version is 1.0.2. Now I want at most 10 map tasks running at the same time. I have found 2 variable related to this question. a) mapred.job.map.capacity but in my hadoop version, this parameter seems abandoned. b) mapred.jobtracker.taskScheduler.maxRunningTasksPerJob (http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.collector/1.0.2/mapred-default.xml) I set this variable like below: Configuration conf = new Configuration(); conf.set("date", date); conf.set("mapred

Use mongodb aggregation framework to group by length of array

阅读更多关于 Use mongodb aggregation framework to group by length of array

问题 I have a collection that looks something like this: { "_id": "id0", "name": "...", "saved_things": [ { ... }, { ... }, { ... }, ] } { "_id": "id1", "name": "...", "saved_things": [ { ... }, ] } { "_id": "id2", "name": "...", "saved_things": [ { ... }, ] } etc... I want to use mongodb's aggregation framework in order to come up with a histogram result that tells how many users have a certain count of the saved_things . For example, for the dataset above it could return something like: { "_id":

Difference between mapreduce split and spark paritition

阅读更多关于 Difference between mapreduce split and spark paritition

问题 I wanted to ask is there any significant difference in data partitioning when working with Hadoop/MapReduce and Spark ? They both work on HDFS(TextInputFormat) so it should be same in theory. Are there any cases where the there procedure of data partitioning can differ? Any insights would be very helpful to my study. Thanks 回答1: Is any significant difference in data partitioning when working with Hadoop/mapreduce and Spark? Spark supports all hadoop I/O formats as it uses same Hadoop

Counting bigrams real fast (with or without multiprocessing) - python

阅读更多关于 Counting bigrams real fast (with or without multiprocessing) - python

问题 Given the big.txt from norvig.com/big.txt, the goal is to count the bigrams really fast (Imagine that I have to repeat this counting 100,000 times). According to Fast/Optimize N-gram implementations in python, extracting bigrams like this would be the most optimal: _bigrams = zip(*[text[i:] for i in range(2)]) And if I'm using Python3 , the generator won't be evaluated until i materialize it with list(_bigrams) or some other functions that will do the same. import io from collections import

hadoop reduce task running even after telling on command line as -D mapred.reduce.tasks=0

阅读更多关于 hadoop reduce task running even after telling on command line as -D mapred.reduce.tasks=0

问题 I have a MapReduce program as public static class MapClass extends MapReduceBase implements Mapper<Text, Text, IntWritable, IntWritable> { private final static IntWritable uno = new IntWritable(1); private IntWritable citationCount = new IntWritable(); public void map(Text key, Text value, OutputCollector<IntWritable, IntWritable> output, Reporter reporter) throws IOException { citationCount.set(Integer.parseInt(value.toString())); output.collect(citationCount, uno); } } public static class

Separate output files in hadoop mapreduce

阅读更多关于 Separate output files in hadoop mapreduce

问题 My question has probably already been asked but I can not find a clear answer to my question. My MapReduce is a basic WordCount. My current output file is : // filename : 'part-r-00000' 789 a 755 #c 456 d 123 #b How can I change the ouput filename ? Then, is-it possible to have 2 output files : // First output file 789 a 456 d // Second output file 123 #b 755 #c Here's my reduce class : public static class SortReducer extends Reducer<IntWritable, Text, IntWritable, Text> { public void reduce

Java Spark Multiply all rows for column

阅读更多关于 Java Spark Multiply all rows for column

问题 For this given input data; |BASE_CAP_RET|BASE_INC_RET|BASE_TOT_RET|acct_cd|eff_date|id | +------------+------------+------------+----------+-------------------+--------+ |0.1 |0.2 |0.1 |acc1|2004-01-01T00:00:00|10018069| |0.2 |0.2 |0.1|acc1|2004-01-01T00:00:00|10018069| |0.3 |0.2 |0.1 |acc1|2004-01-02T00:00:00|10018069| How do i multiply all rows for the column BASE_CAP_RET , BASE_INC_RET and BASE_TOT_RET ? |BASE_CAP_RET|BASE_INC_RET|BASE_TOT_RET|acct_cd|eff_date|id | +------------+----------

Event Notification of Data Availability in HDFS?

阅读更多关于 Event Notification of Data Availability in HDFS?

问题 What will be the best approach towards implementing a notification system for Hadoop for data availability such that whenever new data comes its creates a notification which can be utilized by job control framework to start their job which depends on that data. Here the main concern is as soon as the data becomes available the job should get triggered instead job polling on NameNode for availability of data? 回答1: What I would do is use a producer/consumer model that can interact with each

“java.io.IOException: Pass a Delete or a Put” when reading HDFS and storing HBase

阅读更多关于 “java.io.IOException: Pass a Delete or a Put” when reading HDFS and storing HBase

问题 I has been crazy with this error in a week. There was a post with the same problem Pass a Delete or a Put error in hbase mapreduce. But that resolution's not really work on mine. My Driver: Configuration conf = HBaseConfiguration.create(); Job job; try { job = new Job(conf, "Training"); job.setJarByClass(TrainingDriver.class); job.setMapperClass(TrainingMapper.class); job.setMapOutputKeyClass(LongWritable.class); job.setMapOutputValueClass(Text.class); FileInputFormat.setInputPaths(job, new

Running MongoDB Queries in Map/Reduce

阅读更多关于 Running MongoDB Queries in Map/Reduce

问题 Is it possible to run MongoDB commands like a query to grab additional data or to do an update from with in MongoDB's MapReduce command. Either in the Map or the Reduce function? Is this completely ludicrous to do anyways? Currently I have some documents that refer to separate collections using the MongoDB DBReference command. Thanks for the help! 回答1: Is it possible to run MongoDB commands... from within MongoDB's MapReduce command. In theory, this is possible. In practice there are lots of