hadoop-streaming

'./manage.py runserver' restarts when celery map/reduce tasks are running; sometimes raises error with inner_run

扶醉桌前 提交于 2019-12-11 08:59:53
问题 I have a view in my django project that fires off a celery task. The celery task itself triggers a few map/reduce jobs via subprocess/fabric and the results of the hadoop job are stored on disk --- nothing is actually stored in the database. After the hadoop job has been completed, the celery task sends a django signal that it is done, something like this: # tasks.py from models import MyModel import signals from fabric.operations import local from celery.task import Task class

hadoop-streaming.jar adds x'09' at the end of each line

偶尔善良 提交于 2019-12-11 06:35:15
问题 I am trying to merge some *_0 (part files in HDFS) files in a HDFS location using the below hadoop-streaming.jar command. hadoop jar $HDPHOME/hadoop-streaming.jar -Dmapred.reduce.tasks=1 -input $INDIR -output $OUTTMP/${OUTFILE} -mapper cat -reducer cat Things work fine - Except that, I get into problems, as, the result from above command seem to add x'09' to the end of each line. We have Hive tables defined on top of the part files (which are replaced with the merged file) where the last

Hadoop Buffering vs Streaming

元气小坏坏 提交于 2019-12-11 03:11:32
问题 Could someone please explain to me what is the difference between Hadoop Streaming vs Buffering? Here is the context I have read in Hive : In every map/reduce stage of the join, the last table in the sequence is streamed through the reducers whereas the others are buffered. Therefore, it helps to reduce the memory needed in the reducer for buffering the rows for a particular value of the join key by organizing the tables such that the largest tables appear last in the sequence. e.g. in:

Hadoop global variable with streaming

牧云@^-^@ 提交于 2019-12-11 02:58:47
问题 I understand that i can give some global value to my mappers via the Job and the Configuration. But how can i do that using Hadoop Streaming(Python in my case)? What is the right way? 回答1: Based on the docs you can specify a command line option ( -cmdenv name=value ) to set environment variables on each distributed machine that you can then use in your mappers/reducers: $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input input.txt \ -output output.txt \ -mapper mapper.py \

AWS Elastic mapreduce doesn't seem to be correctly converting the streaming to jar

扶醉桌前 提交于 2019-12-11 02:38:13
问题 I have a mapper and reducer that work fine when I run them in the piped version: cat data.csv | ./mapper.py | sort -k1,1 | ./reducer.py I used the elastic mapreducer wizard, loaded inputs, outputs, bootstrap, etc. The bootstrap is successful, but I am still getting an error in execution. This is the error I'm getting in my stderr for step 1... + /etc/init.d/hadoop-state-pusher-control stop + PID_FILE=/mnt/var/run/hadoop-state-pusher/hadoop-state-pusher.pid + LOG_FILE=/mnt/var/log/hadoop-state

Pass environment variables to Hive Transform or MapReduce

房东的猫 提交于 2019-12-11 00:07:45
问题 I am trying to pass a custom environment variable to an executable (my-mapper.script in the example below) used in a Hive Transform eg: SELECT TRANSFORM(x, y, z) USING 'my-mapper.script' FROM ( SELECT x, y, z FROM table ) I know in Hadoop streaming this can be achieved using -cmdenv EXAMPLE_DIR=/home/example/dictionaries/ But I do not know how to do this in a Hive Transform/MapReduce. Any ideas? 回答1: You can wrap your script with a simple 2 line bash script to setup the environment. e.g #!

Hadoop streaming with python on Windows

巧了我就是萌 提交于 2019-12-10 19:05:48
问题 I'm using Hortonworks HDP for Windows and have it successfully configured with a master and 2 slaves. I'm using the following command; bin\hadoop jar contrib\streaming\hadoop-streaming-1.1.0-SNAPSHOT.jar -files file:///d:/dev/python/mapper.py,file:///d:/dev/python/reducer.py -mapper "python mapper.py" -reducer "python reduce.py" -input /flume/0424/userlog.MDAC-HD1.MDAC.local..20130424.1366789040945 -output /flume/o%1 -cmdenv PYTHONPATH=c:\python27 The mapper runs through fine, but the log

hadoop, python, subprocess failed with code 127

谁说胖子不能爱 提交于 2019-12-10 15:18:04
问题 I'm trying to run very simple task with mapreduce. mapper.py: #!/usr/bin/env python import sys for line in sys.stdin: print line my txt file: qwerty asdfgh zxc Command line to run the job: hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.6.0-mr1-cdh5.8.0.jar \ -input /user/cloudera/In/test.txt \ -output /user/cloudera/test \ -mapper /home/cloudera/Documents/map.py \ -file /home/cloudera/Documents/map.py Error: INFO mapreduce.Job: Task Id : attempt_1490617885665

Sorting by value in Hadoop from a file

邮差的信 提交于 2019-12-10 13:45:06
问题 I have a file containing a String, then a space and then a number on every line. Example: Line1: Word 2 Line2 : Word1 8 Line3: Word2 1 I need to sort the number in descending order and then put the result in a file assigning a rank to the numbers. So my output should be a file containing the following format: Line1: Word1 8 1 Line2: Word 2 2 Line3: Word2 1 3 Does anyone has an idea, how can I do it in Hadoop? I am using java with Hadoop. 回答1: You could organize your map/reduce computation

hadoop-streaming: reducer in pending state, doesn't start?

半城伤御伤魂 提交于 2019-12-10 12:08:40
问题 I have a map reduce job which was running fine until I started to see some failed map tasks like attempt_201110302152_0003_m_000010_0 task_201110302152_0003_m_000010 worker1 FAILED Task attempt_201110302152_0003_m_000010_0 failed to report status for 602 seconds. Killing! ------- Task attempt_201110302152_0003_m_000010_0 failed to report status for 607 seconds. Killing! Last 4KB Last 8KB All attempt_201110302152_0003_m_000010_1 task_201110302152_0003_m_000010 master FAILED java.lang