hadoop-streaming

Pass directories not files to hadoop-streaming?

假如想象 提交于 2019-12-03 13:20:50
In my job, I have the need to parse many historical logsets. Individual customers (there are thousands) may have hundreds of log subdirectories broken out by date. For example: logs/Customer_One/2011-01-02-001 logs/Customer_One/2012-02-03-001 logs/Customer_One/2012-02-03-002 logs/Customer_Two/2009-03-03-001 logs/Customer_Two/2009-03-03-002 Each individual log set may itself be five or six levels deep and contain thousands of files. Therefore, I actually want the individual map jobs to handle walking the subdirectories: simply enumerating individual files is part of my distributed computing

Hadoop Java Error : Exception in thread “main” java.lang.NoClassDefFoundError: WordCount (wrong name: org/myorg/WordCount)

百般思念 提交于 2019-12-03 12:39:49
I am new to hadoop. I followed the maichel-noll tutorial to set up hadoop in single node.I tried running WordCount program. This is the code I used: import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output

How to import a custom module in a MapReduce job?

时光总嘲笑我的痴心妄想 提交于 2019-12-03 01:11:38
I have a MapReduce job defined in main.py , which imports the lib module from lib.py . I use Hadoop Streaming to submit this job to the Hadoop cluster as follows: hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -files lib.py,main.py -mapper "./main.py map" -reducer "./main.py reduce" -input input -output output In my understanding, this should put both main.py and lib.py into the distributed cache folder on each computing machine and thus make module lib available to main . But it doesn't happen: from the log I see that files are really copied to the same directory, but main can't

hdfs command is deprecated in hadoop

こ雲淡風輕ζ 提交于 2019-12-02 14:58:12
问题 As I am following below procedure: http://www.codeproject.com/Articles/757934/Apache-Hadoop-for-Windows-Platform https://www.youtube.com/watch?v=VhxWig96dME. While executing the command c :/hadoop-2.3.0/bin/hadoop namenode -format , I got the error message given below **DEPRECATED:Use of this script to execute hdfs command is deprecated. Instead use the hdfs command for it. Exception in thread "main" java.lang.NoClassDefFountError** I am using jdk-6-windows-amd64.exe . How to solve this issue

hdfs command is deprecated in hadoop

半城伤御伤魂 提交于 2019-12-02 08:41:49
As I am following below procedure: http://www.codeproject.com/Articles/757934/Apache-Hadoop-for-Windows-Platform https://www.youtube.com/watch?v=VhxWig96dME . While executing the command c :/hadoop-2.3.0/bin/hadoop namenode -format , I got the error message given below **DEPRECATED:Use of this script to execute hdfs command is deprecated. Instead use the hdfs command for it. Exception in thread "main" java.lang.NoClassDefFountError** I am using jdk-6-windows-amd64.exe . How to solve this issue ? use the cmd c:/hadoop-2.3.0/bin/hdfs to replace c:/hadoop-2.3.0/bin/hadoop A lot of hdfs cmds are

delimiting carat A in python

为君一笑 提交于 2019-12-02 06:15:55
问题 I have data in form: 37101000ssd48800^A1420asd938987^A2011-09-10^A18:47:50.000^A99.00^A1^A0^A 37101000sd48801^A44557asd03082^A2011-09-06^A13:24:58.000^A42.01^A1^A0^A So first I took it literally and tried: line = line.split("^A") and also line = line.split("\\u001") So, the issue is: The first approach works on my local machine if I do this: cat input.txt | python mapper.py It runs fine locally (input.txt is the above data), but fails on hadoop streaming clusters. Someone told me that I

delimiting carat A in python

半腔热情 提交于 2019-12-02 02:25:22
I have data in form: 37101000ssd48800^A1420asd938987^A2011-09-10^A18:47:50.000^A99.00^A1^A0^A 37101000sd48801^A44557asd03082^A2011-09-06^A13:24:58.000^A42.01^A1^A0^A So first I took it literally and tried: line = line.split("^A") and also line = line.split("\\u001") So, the issue is: The first approach works on my local machine if I do this: cat input.txt | python mapper.py It runs fine locally (input.txt is the above data), but fails on hadoop streaming clusters. Someone told me that I should use "\\u001" as the delimiter, but this is also not working, either on my local machine or on

Hive FAILED: ParseException line 2:0 cannot recognize input near ''macaddress'' 'CHAR' '(' in column specification

痴心易碎 提交于 2019-12-01 17:53:53
I've tried running hive -v -f sqlfile.sql Here is the content of the file CREATE TABLE UpStreamParam ( 'macaddress' CHAR(50), 'datats' BIGINT, 'cmtstimestamp' BIGINT, 'modulation' INT, 'chnlidx' INT, 'severity' BIGINT, 'rxpower' FLOAT, 'sigqnoise' FLOAT, 'noisedeviation' FLOAT, 'prefecber' FLOAT, 'postfecber' FLOAT, 'txpower' FLOAT, 'txpowerdrop' FLOAT, 'nmter' FLOAT, 'premtter' FLOAT, 'postmtter' FLOAT, 'unerroreds' BIGINT, 'corrected' BIGINT, 'uncorrectables' BIGINT) STORED AS ORC TBLPROPERTIES ("orc.compress"="SNAPPY","orc.bloom.filters.columns"="macaddress") PARTITIONED BY ('cmtsid' CHAR

Importing text file : No Columns to parse from file

时光毁灭记忆、已成空白 提交于 2019-12-01 16:11:25
I am trying to take input from sys.stdin. This is a map reducer program for hadoop. Input file is in txt form. Preview of the data set: 196 242 3 881250949 186 302 3 891717742 22 377 1 878887116 244 51 2 880606923 166 346 1 886397596 298 474 4 884182806 115 265 2 881171488 253 465 5 891628467 305 451 3 886324817 6 86 3 883603013 62 257 2 879372434 286 1014 5 879781125 200 222 5 876042340 210 40 3 891035994 224 29 3 888104457 303 785 3 879485318 122 387 5 879270459 194 274 2 879539794 291 1042 4 874834944 Code that I have been trying - import sys df = pd.read_csv(sys.stdin,error_bad_lines=False

Getting the count of records in a data frame quickly

淺唱寂寞╮ 提交于 2019-12-01 15:05:11
I have a dataframe with as many as 10 million records. How can I get a count quickly? df.count is taking a very long time. It's going to take so much time anyway. At least the first time. One way is to cache the dataframe, so you will be able to more with it, other than count. E.g df.cache() df.count() Subsequent operations don't take much time. Ahmed file.groupBy("<column-name>").count().show() 来源: https://stackoverflow.com/questions/39357238/getting-the-count-of-records-in-a-data-frame-quickly