MapReduce | 易学教程

Hadoop strange behaviour: reduce function doesn't get all values for a key

阅读更多关于 Hadoop strange behaviour: reduce function doesn't get all values for a key

问题 In my Hadoop project, I am reading lines of text file with a number of names for each line. The first name represents my username, and the rest are a list of friends. Then I am creating pairs of (username, friend) , in the map function, each pair has a key "Key[name1][name2]" where name1,2 are the username and the friend name ordered alphabetically. Normally, after reading the line of userA and line of userB , and they both have each other in their friends list, I would get 2 identic keys

MapReduce的详细运行阶段

阅读更多关于 MapReduce的详细运行阶段

①②③④ map task读文件，通过TextInputFormat读文本切片，一次读一行，返回（key,value）; ⑤ 上一步获取的（key,value）键值对经过map方法逻辑处理成新的（key，value）键值对，通过context.write输出到OutputCollectior收集器 shuffle阶段 ⑥ OutputCollectior把手机的（key，value）键值对写入到环形缓冲区中，环形缓冲区默认大小为100M，当写到80%时就会触发spill溢写 ⑦ 在溢写之前会对数据进行分区和排序，会对环形缓冲区里面的每个（key，value）键值对hash一个partition值，相同partition值为同一分区，并按照key排序（快排） ⑧ 将环形缓冲区排序后的内存数据不断溢写都本地磁盘，如果map阶段处理的数据较大，可能会溢写多个文件（80M 一个块默认128M 正常溢写两个文件但逻辑块可能大于128M 造成溢写超过两个） ⑨ 多个溢写的文件被merge合并成大文件（归并排序），此时map task最终结果文件是分区且区内有序 ⑩ reduce task根据自己的分区号，去各个map task节点拷贝相同partition的数据到reduce task本地磁盘工作目录 ①① reduce task 会把同一分区的来自不同map task的结果文件

Finding hostname of slave nodes in hadoop during execution of running map-reduce

阅读更多关于 Finding hostname of slave nodes in hadoop during execution of running map-reduce

问题 I want to know how to execute map reduce code on Hadoop 2.9.0 multi-node cluster? I wanna understand which node process which input. Actually, How to find every part of input data is processed by which mapper? I executed following python code on master: import sys import socket for line in sys.stdin: line = line.strip() words = line.split() for word in words: print('%s\t%s\t%s' % (word, 1, socket.gethostname())) I used socket.gethostname() to finding hostname of nodes. I expecte output of

Hadoop MapReduce NoSuchElementException

阅读更多关于 Hadoop MapReduce NoSuchElementException

问题 I wanted to run a MapReduce-Job on my FreeBSD-Cluster with two nodes but I get the following Exception 14/08/27 14:23:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/08/27 14:23:04 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id 14/08/27 14:23:04 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 14/08/27 14:23:04 WARN

Hadoop/MR temporary directory

阅读更多关于 Hadoop/MR temporary directory

问题 I've been struggling with getting Hadoop and Map/Reduce to start using a separate temporary directory instead of the /tmp on my root directory. I've added the following to my core-site.xml config file: <property> <name>hadoop.tmp.dir</name> <value>/data/tmp</value> </property> I've added the following to my mapreduce-site.xml config file: <property> <name>mapreduce.cluster.local.dir</name> <value>${hadoop.tmp.dir}/mapred/local</value> </property> <property> <name>mapreduce.jobtracker.system

hadoop-streaming : reduce task in pending state says “No room for reduce task.”

阅读更多关于 hadoop-streaming : reduce task in pending state says “No room for reduce task.”

问题 My map task completes successfully and I can see the application logs, but reducer stays in pending state Kind % Complete Num Tasks Pending Running Complete Killed Failed/Killed Task Attempts map 100.00% 200 0 0 200 0 0 / 40 reduce 0.00% 1 1 0 0 0 0 / 0 When I look at reduce task, I see All Task Attempts No Task Attempts found When I see the hadoop-hduser-jobtracker-master.log, I see the following : 2011-10-31 00:00:00,238 WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task.

How would you use map reduce on this document structure?

阅读更多关于 How would you use map reduce on this document structure?

问题 If I wanted to count foobar.relationships.friend.count, how would I use map/reduce against this document structure so the count will equal 22. [ [0] { "rank" => nil, "profile_id" => 3, "20130913" => { "foobar" => { "relationships" => { "acquaintance" => { "count" => 0 }, "friend" => { "males_count" => 0, "ids" => [], "females_count" => 0, "count" => 10 } } } }, "20130912" => { "foobar" => { "relationships" => { "acquaintance" => { "count" => 0 }, "friend" => { "males_count" => 0, "ids" => [

Can't load avro schema in pig

阅读更多关于 Can't load avro schema in pig

问题 I have an avro schema, and I am writing data with that schema to an AvroSequenceFileOutputFormat . I looked in the file and can confirm that the schema is there to read. I call the function avro = load 'part-r-00000.avro' using AvroStorage(); and it gives me the error message ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2245: Cannot get schema from loadFunc org.apache.pig.builtin.AvroStorage Details at logfile: /Users/ajosephs/Code/serialization-protocol/output/pig_1391635368675.log Does

MongoDB MapReduce, second argument of reduce function is multidimensional array

阅读更多关于 MongoDB MapReduce, second argument of reduce function is multidimensional array

问题 I tried to use mapReduce for my collection. Just for debug I returned vals value passed as second argument do reduce function, like this: db.runCommand({ "mapreduce":"MyCollection", "map":function() { emit( { country_code:this.cc, partner:this.di, registeredPeriod:Math.floor((this.ca - 1399240800)/604800) }, { count:Math.ceil((this.lla - this.ca)/86400) }); }, "reduce":function(k, vals) { return { 'count':vals }; }, "query":{ "ca":{ "$gte":1399240800 }, "di":405, "cc":"1" }, "out":{ "inline"

map reduce output files: part-r-* and part-*

阅读更多关于 map reduce output files: part-r-* and part-*

问题 I have some questions about map reduce output part files. 1> What are the differences between part-r-* files and part-* files in map reduce output? part-r-* is output from mapper and part-* is from reducer? 2> If reducer doesn't produce any results, mapper output will be staying or will be deleted? 回答1: Normally, part-r-* comes from the reducer. MultipleOutputs allows you to use a different naming convention. If there is no reduce step, the output will be part-m-*. As I understand it, if