elastic-map-reduce

jar containing org.apache.hadoop.hive.dynamodb

不羁岁月 提交于 2019-12-13 01:44:18
问题 I was trying to programmatically Load a dynamodb table into HDFS (via java, and not hive), I couldnt find examples online on how to do it, so thought I'd download the jar containing org.apache.hadoop.hive.dynamodb and reverse engineer the process. Unfortunately, I couldn't find the file as well :(. Could someone answer the following questions for me (listed in order of priority). Java example that loads a dynamodb table into HDFS (that can be passed to a mapper as a table input format). the

Elasticsearch to query across multiple indices and multiple types

旧街凉风 提交于 2019-12-12 02:08:14
问题 I am newbie to elasticsearch .I am using AWS elastic search instance 5.1.1. I have a requirement where I need to specify multiple indices and types in request body of Elasticsearch for search operation ,is it possible ? What is the simplest way to do it an example would be appreciated. Thanks in advance ! 回答1: Referring back to documentation you can try doing a simple CURL as below curl -XGET 'localhost:9200/_search?pretty' This should ideally query across all indices and types. Hope This

Problems using distcp and s3distcp with my EMR job that outputs to HDFS

老子叫甜甜 提交于 2019-12-11 08:12:13
问题 I've run a job on AWS's EMR, and stored the output in the EMR job's HDFS. I am then trying to copy the result to S3 via distcp or s3distcp, but both are failing as described below. (Note: the reason I'm not just sending my EMR job's output directly to S3 is due to the (currently unresolved) problem I describe in Where is my AWS EMR reducer output for my completed job (should be on S3, but nothing there)? For distcp, I run (following this post's recommendation): elastic-mapreduce --jobflow <MY

AWS Elastic mapreduce doesn't seem to be correctly converting the streaming to jar

扶醉桌前 提交于 2019-12-11 02:38:13
问题 I have a mapper and reducer that work fine when I run them in the piped version: cat data.csv | ./mapper.py | sort -k1,1 | ./reducer.py I used the elastic mapreducer wizard, loaded inputs, outputs, bootstrap, etc. The bootstrap is successful, but I am still getting an error in execution. This is the error I'm getting in my stderr for step 1... + /etc/init.d/hadoop-state-pusher-control stop + PID_FILE=/mnt/var/run/hadoop-state-pusher/hadoop-state-pusher.pid + LOG_FILE=/mnt/var/log/hadoop-state

Write 100 million files to s3

江枫思渺然 提交于 2019-12-11 02:32:36
问题 My main aim is to split out records into files according to the ids of each record, and there are over 15 billion records right now which can certainly increase. I need a scalable solution using Amazon EMR. I have already got this done for a smaller dataset having around 900 million records. Input files are in csv format, with one of the field which is need to be the file name in the output. So say that there are following input records: awesomeId1, somedetail1, somedetail2 awesomeID1,

In Hadoop, where can i change default url ports 50070 and 50030 for namenode and jobtracker webpages

…衆ロ難τιáo~ 提交于 2019-12-09 18:36:00
问题 There must be a way to change the ports 50070 and 50030 so that the following urls display the clustr statuses on the ports i pick NameNode - http://localhost:50070/ JobTracker - http://localhost:50030/ 回答1: Define your choice of ports by setting properties dfs.http.address for Namenode and mapred.job.tracker.http.address for Jobtracker in conf/core-site.xml: <configuration> <property> <name>dfs.http.address</name> <value>50070</value> </property> <property> <name>mapred.job.tracker.http

Getting “No space left on device” for approx. 10 GB of data on EMR m1.large instances

旧时模样 提交于 2019-12-09 17:51:54
问题 I am getting an error "No space left on device" when I am running my Amazon EMR jobs using m1.large as the instance type for the hadoop instances to be created by the jobflow. The job generates approx. 10 GB of data at max and since the capacity of a m1.large instance is supposed to be 420GB*2 (according to: EC2 instance types ). I am confused how just 10GB of data could lead to a "disk space full" kind of a message. I am aware of the possibility that this kind of an error can also be

Loading data with Hive, S3, EMR, and Recover Partitions

人盡茶涼 提交于 2019-12-09 05:40:34
问题 SOLVED: See Update #2 below for the 'solution' to this issue. ~~~~~~~ In s3, I have some log*.gz files stored in a nested directory structure like: s3://($BUCKET)/y=2012/m=11/d=09/H=10/ I'm attempting to load these into Hive on Elastic Map Reduce (EMR), using a multi-level partition spec like: create external table logs (content string) partitioned by (y string, m string, d string, h string) location 's3://($BUCKET)'; Creation of the table works. I then attempt to recover all of the existing

Copying/using Python files from S3 to Amazon Elastic MapReduce at bootstrap time

孤街浪徒 提交于 2019-12-08 13:37:43
问题 I've figured out how to install python packages (numpy and such) at the bootstrapping step using boto, as well as copying files from S3 to my EC2 instances, still with boto. What I haven't figured out is how to distribute python scripts (or any file) from S3 buckets to each EMR instance using boto. Any pointers? 回答1: If you are using boto, I recommend packaging all your Python files in an archive (.tar.gz format) and then using the cacheArchive directive in Hadoop/EMR to access it. This is

How do I make sure RegexSerDe is available to my Hadoop nodes?

旧巷老猫 提交于 2019-12-08 08:01:47
问题 I'm trying to attack the problem of analyzing web logs with Hive, and I've seen plenty of examples out there, but I can't seem to find anyone with this specific issue. Here's where I'm at: I've set up an AWS ElasticMapReduce cluster, I can log in, and I fire up Hive. I make sure to add jar hive-contrib-0.8.1.jar , and it says it's loaded. I create a table called event_log_raw , with a few string columns and a regex. load data inpath '/user/hadoop/tmp overwrite into table event_log_raw , and I