hadoop-streaming

Permission denied error 13 - Python on Hadoop

落爺英雄遲暮 提交于 2020-01-15 07:12:31
问题 I am running a simple Python mapper and reducer and am getting 13 permission denied error . Need help. I am not sure what is happening here and need help. New to Hadoop world. I am running simple map reduce for counting word. The mapper and reducer are running independently on linus or windows powershell ====================================================================== hadoop@ubuntu:~/hadoop-1.2.1$ bin/hadoop jar contrib/streaming/hadoop-streaming-1.2.1.jar -file /home/hadoop/mapper.py

Hadoop: strange ClassNotFoundException

南楼画角 提交于 2020-01-07 04:07:29
问题 I am getting a classnotfound exception. The class which is claimed to be not found does not exist, but the class name is set as the path to the list of input files for my map reduce jobs. INFO server Running: /usr/lib/hadoop/bin/hadoop --config /var/run/cloudera-scm-agent/process/155-hue/JOBSUBD/hadoop-conf jar tmp.jar /user/hduser/datasets/ /user/hduser/tmp/job_20/ mongodb://slave15/db_8.job_20 Exception in thread "main" java.lang.ClassNotFoundException: /user/hduser/datasets/ at java.lang

hadoop streaming produces uncompressed files despite mapred.output.compress=true

本秂侑毒 提交于 2020-01-06 12:54:51
问题 I run a hadoop streaming job like this: hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar -Dmapred.reduce.tasks=16 -Dmapred.output.compres=true -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec -input foo -output bar -mapper "python zot.py" -reducer /bin/cat I do get 16 files in the output directory which contain the correct data, but the files are not compressed: $ hadoop fs -get bar/part-00012 $ file part-00012 part-00012: ASCII text,

Hadoop Streaming Job with binary input?

你离开我真会死。 提交于 2020-01-05 06:53:09
问题 I wish to convert a binary file in one format to a SequenceFile. I have a Python script that takes that format on stdin and can output whatever I want. The input format is not line-based. The individual records are binary themselves, hence the output format cannot be \t delimited or broken into lines with \n. Can I use the Hadoop Streaming interface to consume a binary format? How do I produce a binary output format? I assume the answer is "No" unless I hear otherwise. 回答1: You may consider

Hadoop streaming with zip input files

杀马特。学长 韩版系。学妹 提交于 2020-01-04 05:26:16
问题 I'm trying to run a streaming job where the input files are csv inside zip files. I tried using this, however it doesn't seem for work with CDH4 (I get the error class com.cotdp.hadoop.ZipFileInputFormat not org.apache.hadoop.mapred.InputFormat ) Anyone know of an input file reader I can use for streaming with zip files? If possible, I'm looking for a multi file reader (that can be given the top level directory). 回答1: I ended up writing zipstream. Note that is process only the first file in

Hadoop streaming: single file or multi file per map. Don't Split

放肆的年华 提交于 2020-01-02 10:29:32
问题 I have a lot of zip files that need to be processed by a C++ library. So I use C++ to write my hadoop streaming program. The program will read a zip file, unzip it, and process the extracted data. My problem is that: my mapper can't get the content of exactly one file. It usually gets something like 2.4 files or 3.2 files. Hadoop will send several files to my mapper but at least one of the file is partial. You know zip files can't be processed like this. Can I get exactly one file per map? I

Unzip files using hadoop streaming

人盡茶涼 提交于 2019-12-31 02:02:05
问题 I have many files in HDFS, all of them a zip file with one CSV file inside it. I'm trying to uncompress the files so I can run a streaming job on them. I tried: hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \ -D mapred.reduce.tasks=0 \ -mapper /bin/zcat -reducer /bin/cat \ -input /path/to/files/ \ -output /path/to/output However I get an error ( subprocess failed with code 1 ) I also tried running on a single file, same error. Any advice? 回答1: The root cause of the problem is: you

Hive FAILED: ParseException line 2:0 cannot recognize input near ''macaddress'' 'CHAR' '(' in column specification

和自甴很熟 提交于 2019-12-30 18:53:07
问题 I've tried running hive -v -f sqlfile.sql Here is the content of the file CREATE TABLE UpStreamParam ( 'macaddress' CHAR(50), 'datats' BIGINT, 'cmtstimestamp' BIGINT, 'modulation' INT, 'chnlidx' INT, 'severity' BIGINT, 'rxpower' FLOAT, 'sigqnoise' FLOAT, 'noisedeviation' FLOAT, 'prefecber' FLOAT, 'postfecber' FLOAT, 'txpower' FLOAT, 'txpowerdrop' FLOAT, 'nmter' FLOAT, 'premtter' FLOAT, 'postmtter' FLOAT, 'unerroreds' BIGINT, 'corrected' BIGINT, 'uncorrectables' BIGINT) STORED AS ORC

How to read hadoop sequential file?

房东的猫 提交于 2019-12-30 08:21:15
问题 I have a sequential file which is the output of hadoop map-reduce job. In this file data is written in key value pairs ,and value itself is a map. I want to read the value as a MAP object so that i can process it further. Configuration config = new Configuration(); Path path = new Path("D:\\OSP\\sample_data\\data\\part-00000"); SequenceFile.Reader reader = new SequenceFile.Reader(FileSystem.get(config), path, config); WritableComparable key = (WritableComparable) reader.getKeyClass()

Managing dependencies with Hadoop Streaming?

China☆狼群 提交于 2019-12-25 08:25:38
问题 I have a quick Hadoop Streaming question. If I'm using Python streaming and I have Python packages that my mappers/reducers require but aren't installed by default do I need to install those on all the Hadoop machines as well or is there some sort of serialization that sends them to the remote machines? 回答1: If they're not installed on your task boxes, you can send them with -file. If you need a package or other directory structure, you can send a zipfile, which will be unpacked for you. Here