hadoop-streaming

Is it possible to compress json in hive external table?

冷暖自知 提交于 2021-02-10 13:33:16
问题 I want to know how to compress json data in hive external table. How can it be done? I have created external table like this: CREATE EXTERNAL TABLE tweets ( id BIGINT,created_at STRING,source STRING,favorited BOOLEAN )ROW FORMAT SERDE "com.cloudera.hive.serde.JSONSerDe" LOCATION "/user/cloudera/tweets"; and I had set the compression properties set mapred.output.compress=true; set hive.exec.compress.output=true; set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec; set

Hadoop: Error: java.lang.RuntimeException: Error in configuring object

老子叫甜甜 提交于 2021-02-08 13:02:55
问题 I have Hadoop installed and working perfectly because I run the word count example and it works great. Now I tried to move forward and do some more real examples. My example is done in this website as Example 2 (Average Salaries by each department) . I am using the same code from the website and this data mapper.py #!usr/bin/Python # mapper.py import csv import sys reader = csv.reader(sys.stdin, delimiter=',') writer = csv.writer(sys.stdout, delimiter='\t') for row in reader: agency = row[3]

Hadoop: Error: java.lang.RuntimeException: Error in configuring object

半腔热情 提交于 2021-02-08 13:02:18
问题 I have Hadoop installed and working perfectly because I run the word count example and it works great. Now I tried to move forward and do some more real examples. My example is done in this website as Example 2 (Average Salaries by each department) . I am using the same code from the website and this data mapper.py #!usr/bin/Python # mapper.py import csv import sys reader = csv.reader(sys.stdin, delimiter=',') writer = csv.writer(sys.stdout, delimiter='\t') for row in reader: agency = row[3]

Error when running python map reduce job using Hadoop streaming in Google Cloud Dataproc environment

强颜欢笑 提交于 2020-07-05 04:55:34
问题 I want to run python map reduce job in Google Cloud Dataproc using hadoop streaming method. My map reduce python script, input file and job result output are located in Google Cloud Storage. I tried to run this command hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar -file gs://bucket-name/intro_to_mapreduce/mapper_prod_cat.py -mapper gs://bucket-name/intro_to_mapreduce/mapper_prod_cat.py -file gs://bucket-name/intro_to_mapreduce/reducer_prod_cat.py -reducer gs://bucket-name/intro_to

Hadoop streaming “GC overhead limit exceeded”

无人久伴 提交于 2020-01-24 12:20:08
问题 I am running this command: hadoop jar hadoop-streaming.jar -D stream.tmpdir=/tmp -input "<input dir>" -output "<output dir>" -mapper "grep 20151026" -reducer "wc -l" Where <input dir> is a directory with many avro files. And getting this error: Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded at org.apache.hadoop.hdfs.protocol.DatanodeID.updateXferAddrAndInvalidateHashCode(DatanodeID.java:287) at org.apache.hadoop.hdfs.protocol.DatanodeID.(DatanodeID.java:91)

Tool/Ways to schedule Amazon's Elastic MapReduce jobs

…衆ロ難τιáo~ 提交于 2020-01-24 10:26:12
问题 I use EMR to create new instances and process the jobs and then shutdown instances. My requirement is to schedule jobs in periodic fashion. One of the easy implementation can be to use quartz to trigger EMR jobs. But looking at longer run I am interested in using out of box mapreduce scheduling solution. My question is that is there any out of box scheduling feature provided by EMR or AWS-SDK, which i can use for my requirement? I can see there is scheduling in Auto scaling, but i want to

hadoop streaming: where are application logs?

随声附和 提交于 2020-01-17 14:06:50
问题 My question is similar to : hadoop streaming: how to see application logs? (The link in the answer is not currently working. So I have to post it again with an additional question) I can see all hadoop logs on my /usr/local/hadoop/logs path but where can I see application level logs? for example : reducer.py - import logging .... logging.basicConfig(level=logging.ERROR, format='MAP %(asctime)s%(levelname)s%(message)s') logging.error('Test!') ... I am not able to see any of the logs (WARNING

How to specify the partitioner for hadoop streaming

可紊 提交于 2020-01-15 09:55:21
问题 I have a custom partitioner like below: import java.util.*; import org.apache.hadoop.mapreduce.*; public static class SignaturePartitioner extends Partitioner<Text,Text> { @Override public int getPartition(Text key,Text value,int numReduceTasks) { return (key.toString().Split(' ')[0].hashCode() & Integer.MAX_VALUE) % numReduceTasks; } } I set the hadoop streaming parameter like below -file SignaturePartitioner.java \ -partitioner SignaturePartitioner \ Then I get an error: Class Not Found. Do

Permission denied error 13 - Python on Hadoop

末鹿安然 提交于 2020-01-15 07:12:49
问题 I am running a simple Python mapper and reducer and am getting 13 permission denied error . Need help. I am not sure what is happening here and need help. New to Hadoop world. I am running simple map reduce for counting word. The mapper and reducer are running independently on linus or windows powershell ====================================================================== hadoop@ubuntu:~/hadoop-1.2.1$ bin/hadoop jar contrib/streaming/hadoop-streaming-1.2.1.jar -file /home/hadoop/mapper.py