hadoop-streaming | 易学教程

Python Hadoop Streaming Error “ERROR streaming.StreamJob: Job not Successful!” and Stack trace: ExitCodeException exitCode=134

阅读更多关于 Python Hadoop Streaming Error “ERROR streaming.StreamJob: Job not Successful!” and Stack trace: ExitCodeException exitCode=134

问题 I am trying to run python script on Hadoop cluster using Hadoop Streaming for sentiment analysis. The Same script I am running on Local machine which is running Properly and giving output. to run on local machine I use this command. $ cat /home/MB/analytics/Data/input/* | ./new_mapper.py and to run on hadoop cluster I use below command $ hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.5.0-mr1-cdh5.2.0.jar -mapper "python $PWD/new_mapper.py" -reducer "$PWD/new

Hadoop streaming - remove trailing tab from reducer output

阅读更多关于 Hadoop streaming - remove trailing tab from reducer output

问题 I have a hadoop streaming job whose output does not contain key/value pairs. You can think of it as value-only pairs or key-only pairs. My streaming reducer (a php script) is outputting records separated by newlines. Hadoop streaming treats this as a key with no value, and inserts a tab before the newline. This extra tab is unwanted. How do I remove it? I am using hadoop 1.0.3 with AWS EMR. I downloaded the source of hadoop 1.0.3 and found this code in hadoop-1.0.3/src/contrib/streaming/src

Hadoop cluster - Do I need to replicate my code over all machines before running job?

阅读更多关于 Hadoop cluster - Do I need to replicate my code over all machines before running job?

问题 This is what confuses me, when I use wordcount example, I keep code at master and let him do things with slaves and it runs fine But when I am running my code, it starts to fail on slaves giving weird errors like Traceback (most recent call last): File "/app/hadoop/tmp/mapred/local/taskTracker/hduser/jobcache/job_201110250901_0005/attempt_201110250901_0005_m_000001_1/work/./mapper.py", line 55, in <module> from src.utilities import utilities ImportError: No module named src.utilities java

hadoop streaming: how to see application logs?

阅读更多关于 hadoop streaming: how to see application logs?

问题 I can see all hadoop logs on my /usr/local/hadoop/logs path but where can I see application level logs? for example : mapper.py import logging def main(): logging.info("starting map task now") // -- do some task -- // print statement reducer.py import logging def main(): for line in sys.stdin: logging.info("received input to reducer - " + line) // -- do some task -- // print statement Where I can see logging.info or related log statements of my application? I am using Python and using hadoop

How to set the precise max number of concurrently running tasks per node in Hadoop 2.4.0 on Elastic MapReduce

阅读更多关于 How to set the precise max number of concurrently running tasks per node in Hadoop 2.4.0 on Elastic MapReduce

问题 According to http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/, the formula for determining the number of concurrently running tasks per node is: min (yarn.nodemanager.resource.memory-mb / mapreduce.[map|reduce].memory.mb, yarn.nodemanager.resource.cpu-vcores / mapreduce.[map|reduce].cpu.vcores) . However, on setting these parameters to (for a cluster of c3.2xlarges): yarn.nodemanager.resource.memory-mb = 14336 mapreduce.map.memory.mb = 2048 yarn

Running a job using hadoop streaming and mrjob: PipeMapRed.waitOutputThreads(): subprocess failed with code 1

阅读更多关于 Running a job using hadoop streaming and mrjob: PipeMapRed.waitOutputThreads(): subprocess failed with code 1

Hey I'm fairly new to the world of Big Data. I came across this tutorial on http://musicmachinery.com/2011/09/04/how-to-process-a-million-songs-in-20-minutes/ It describes in detail of how to run MapReduce job using mrjob both locally and on Elastic Map Reduce. Well I'm trying to run this on my own Hadoop cluser. I ran the job using the following command. python density.py tiny.dat -r hadoop --hadoop-bin /usr/bin/hadoop > outputmusic And this is what I get: HADOOP: Running job: job_1369345811890_0245 HADOOP: Job job_1369345811890_0245 running in uber mode : false HADOOP: map 0% reduce 0%

Pivot table with Apache Pig

阅读更多关于 Pivot table with Apache Pig

I wonder if it's possible to pivot a table in one pass in Apache Pig. Input: Id Column1 Column2 Column3 1 Row11 Row12 Row13 2 Row21 Row22 Row23 Output: Id Name Value 1 Column1 Row11 1 Column2 Row12 1 Column3 Row13 2 Column1 Row21 2 Column2 Row22 2 Column3 Row23 The real data has dozens of columns. I can do that with awk in one pass then run it with Hadoop Streaming. But majority of my code is is Apache Pig so I wonder if it's possible to do it in Pig efficiently. You can do it in 2 ways: 1. Write a UDF which returns a bag of tuples. It will be the most flexible solution, but requires Java code

Pivot table with Apache Pig

阅读更多关于 Pivot table with Apache Pig

问题 I wonder if it\'s possible to pivot a table in one pass in Apache Pig. Input: Id Column1 Column2 Column3 1 Row11 Row12 Row13 2 Row21 Row22 Row23 Output: Id Name Value 1 Column1 Row11 1 Column2 Row12 1 Column3 Row13 2 Column1 Row21 2 Column2 Row22 2 Column3 Row23 The real data has dozens of columns. I can do that with awk in one pass then run it with Hadoop Streaming. But majority of my code is is Apache Pig so I wonder if it\'s possible to do it in Pig efficiently. 回答1: You can do it in 2