hadoop-streaming

Python Hadoop Streaming Error “ERROR streaming.StreamJob: Job not Successful!” and Stack trace: ExitCodeException exitCode=134

有些话、适合烂在心里 提交于 2019-11-29 15:15:55
问题 I am trying to run python script on Hadoop cluster using Hadoop Streaming for sentiment analysis. The Same script I am running on Local machine which is running Properly and giving output. to run on local machine I use this command. $ cat /home/MB/analytics/Data/input/* | ./new_mapper.py and to run on hadoop cluster I use below command $ hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.5.0-mr1-cdh5.2.0.jar -mapper "python $PWD/new_mapper.py" -reducer "$PWD/new

Hadoop streaming - remove trailing tab from reducer output

不打扰是莪最后的温柔 提交于 2019-11-29 08:50:55
问题 I have a hadoop streaming job whose output does not contain key/value pairs. You can think of it as value-only pairs or key-only pairs. My streaming reducer (a php script) is outputting records separated by newlines. Hadoop streaming treats this as a key with no value, and inserts a tab before the newline. This extra tab is unwanted. How do I remove it? I am using hadoop 1.0.3 with AWS EMR. I downloaded the source of hadoop 1.0.3 and found this code in hadoop-1.0.3/src/contrib/streaming/src

Hadoop cluster - Do I need to replicate my code over all machines before running job?

≯℡__Kan透↙ 提交于 2019-11-28 09:08:09
问题 This is what confuses me, when I use wordcount example, I keep code at master and let him do things with slaves and it runs fine But when I am running my code, it starts to fail on slaves giving weird errors like Traceback (most recent call last): File "/app/hadoop/tmp/mapred/local/taskTracker/hduser/jobcache/job_201110250901_0005/attempt_201110250901_0005_m_000001_1/work/./mapper.py", line 55, in <module> from src.utilities import utilities ImportError: No module named src.utilities java

hadoop streaming: how to see application logs?

邮差的信 提交于 2019-11-28 06:04:36
问题 I can see all hadoop logs on my /usr/local/hadoop/logs path but where can I see application level logs? for example : mapper.py import logging def main(): logging.info("starting map task now") // -- do some task -- // print statement reducer.py import logging def main(): for line in sys.stdin: logging.info("received input to reducer - " + line) // -- do some task -- // print statement Where I can see logging.info or related log statements of my application? I am using Python and using hadoop

How to set the precise max number of concurrently running tasks per node in Hadoop 2.4.0 on Elastic MapReduce

二次信任 提交于 2019-11-28 03:30:46
问题 According to http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/, the formula for determining the number of concurrently running tasks per node is: min (yarn.nodemanager.resource.memory-mb / mapreduce.[map|reduce].memory.mb, yarn.nodemanager.resource.cpu-vcores / mapreduce.[map|reduce].cpu.vcores) . However, on setting these parameters to (for a cluster of c3.2xlarges): yarn.nodemanager.resource.memory-mb = 14336 mapreduce.map.memory.mb = 2048 yarn

Running a job using hadoop streaming and mrjob: PipeMapRed.waitOutputThreads(): subprocess failed with code 1

冷暖自知 提交于 2019-11-27 23:05:40
Hey I'm fairly new to the world of Big Data. I came across this tutorial on http://musicmachinery.com/2011/09/04/how-to-process-a-million-songs-in-20-minutes/ It describes in detail of how to run MapReduce job using mrjob both locally and on Elastic Map Reduce. Well I'm trying to run this on my own Hadoop cluser. I ran the job using the following command. python density.py tiny.dat -r hadoop --hadoop-bin /usr/bin/hadoop > outputmusic And this is what I get: HADOOP: Running job: job_1369345811890_0245 HADOOP: Job job_1369345811890_0245 running in uber mode : false HADOOP: map 0% reduce 0%

Pivot table with Apache Pig

心已入冬 提交于 2019-11-27 04:55:27
I wonder if it's possible to pivot a table in one pass in Apache Pig. Input: Id Column1 Column2 Column3 1 Row11 Row12 Row13 2 Row21 Row22 Row23 Output: Id Name Value 1 Column1 Row11 1 Column2 Row12 1 Column3 Row13 2 Column1 Row21 2 Column2 Row22 2 Column3 Row23 The real data has dozens of columns. I can do that with awk in one pass then run it with Hadoop Streaming. But majority of my code is is Apache Pig so I wonder if it's possible to do it in Pig efficiently. You can do it in 2 ways: 1. Write a UDF which returns a bag of tuples. It will be the most flexible solution, but requires Java code

Pivot table with Apache Pig

独自空忆成欢 提交于 2019-11-26 11:25:30
问题 I wonder if it\'s possible to pivot a table in one pass in Apache Pig. Input: Id Column1 Column2 Column3 1 Row11 Row12 Row13 2 Row21 Row22 Row23 Output: Id Name Value 1 Column1 Row11 1 Column2 Row12 1 Column3 Row13 2 Column1 Row21 2 Column2 Row22 2 Column3 Row23 The real data has dozens of columns. I can do that with awk in one pass then run it with Hadoop Streaming. But majority of my code is is Apache Pig so I wonder if it\'s possible to do it in Pig efficiently. 回答1: You can do it in 2