emr

pyspark.sql.utils.AnalysisException: u'Path does not exist

℡╲_俬逩灬. 提交于 2019-12-08 02:54:06
问题 I am running a spark job with amazon emr using the standard hdfs, not S3 to store my files. I have a hive table in hdfs://user/hive/warehouse/ but it cannot be found when my spark job is ran. I configured the spark property spark.sql.warehouse.dir to reflect that of my hdfs directory and while the yarn logs do say: 17/03/28 19:54:05 INFO SharedState: Warehouse path is 'hdfs://user/hive/warehouse/'. later on in the logs it says(full log at end of page): LogType:stdout Log Upload Time:Tue Mar

Spark not able to fetch events from Amazon Kinesis

喜你入骨 提交于 2019-12-08 02:47:42
问题 I have been trying to get Spark read events from Kinesis recently but am having problem in receiving the events. While Spark is able to connect to Kinesis and is able to get metadata from Kinesis, Its not able to get events from it. It always fetches zero elements back. There are no errors, just empty results back. Spark is able to get metadata (Eg. number of shards in kinesis etc). I have used these [1 & 2] guides for getting it working but have not got much luck yet. I have also tried

AWS EMR Spark save to S3 is very slow

亡梦爱人 提交于 2019-12-07 11:15:27
问题 I have a Spark job running on EMR that takes unusually long time. Spark tasks themselves are running fast. When I save the result to S3 it spends more than 20mins doing this... 16/02/05 01:44:44 INFO latency: StatusCode=[404], Exception=[com.amazonaws.services.s3.model.AmazonS3Exception: Not Found (Service: Amazon S3; Status Code: 404; Error Code: 404 Not Found; Request ID: 561CA7CD8C009E79), S3 Extended Request ID: B3dMnYkxE/QSZsD1VREBf5FR+uH8m5k2Tb8zZ+Y0+VFgQFSwRJjPEWV7wX61+9ZiJhY5nf35Rx8=]

hadoop streaming: importing modules on EMR

纵然是瞬间 提交于 2019-12-07 08:55:24
问题 This previous question addressed how to import modules such as nltk for hadoop streaming. The steps outlined were: zip -r nltkandyaml.zip nltk yaml mv ntlkandyaml.zip /path/to/where/your/mapper/will/be/nltkandyaml.mod You can now import the nltk module for use in your Python script: import zipimport importer = zipimport.zipimporter('nltkandyaml.mod') yaml = importer.load_module('yaml') nltk = importer.load_module('nltk') I have a job that I want to run on Amazon's EMR, and I'm not sure where

EMR vs EC2/Hadoop on AWS

二次信任 提交于 2019-12-07 07:44:18
问题 I know that EC2 is more flexible but more work over EMR. However in terms of costs, if using EC2 it probably requires EBS volumes attached to the EC2 instances, whereas AWS just streams in data from S3. So crunching the numbers on the AWS calculator, even though for EMR one must pay for EC2 also, EMR becomes cheaper than EC2 ?? Am i wrong here ? Of course EC2 with EBS is probably faster, but is it worth the cost ? thanks, Matt 回答1: EMR does a lot of things for you that you won't find on

Use bootstrap to replace default jar on EMR

↘锁芯ラ 提交于 2019-12-07 07:20:27
I am on a EMR cluster with AMI 3.0.4. Once the cluster is up, I ssh to master and did the following manually: cd /home/hadoop/share/hadoop/common/lib/ rm guava-11.0.2.jar wget http://central.maven.org/maven2/com/google/guava/guava/14.0.1/guava-14.0.1.jar chmod 777 guava-14.0.1.jar Is it possible to do above in a bootstrap action? Thanks! With EMR 4.0 the hadoop installation path changed. So the manual update of guava-14.0.1.jar must be changed to: cd /usr/lib/hadoop/lib sudo wget http://central.maven.org/maven2/com/google/guava/guava/14.0.1/guava-14.0.1.jar sudo rm guava-11.0.2.jar The

Compress file on S3

感情迁移 提交于 2019-12-06 18:21:57
问题 I have a 17.7GB file on S3. It was generated as the output of a Hive query, and it isn't compressed. I know that by compressing it, it'll be about 2.2GB (gzip). How can I download this file locally as quickly as possible when transfer is the bottleneck (250kB/s). I've not found any straightforward way to compress the file on S3, or enable compression on transfer in s3cmd, boto, or related tools. 回答1: S3 does not support stream compression nor is it possible to compress the uploaded file

GroupBy Operation of DataFrame takes lot of time in spark 2.0

会有一股神秘感。 提交于 2019-12-06 08:17:06
问题 In one of my spark job (2.0 on EMR 5.0.0) where I had about 5GB of data that was crossed joined with 30 rows(data size few MBs). I further needed to group by it. What I noticed that I was taking lot of time (Approximately 4 hours with one m3.xlarge master and six m3.2xlarge core nodes). In total time 2 hour was taken by processing and another 2 hour was taken to write data to s3. The time taken was not very impressive to me. I tried searching over net and found this link that says groupBy

YARN log aggregation on AWS EMR - UnsupportedFileSystemException

守給你的承諾、 提交于 2019-12-06 03:59:50
问题 I am struggling to enable YARN log aggregation for my Amazon EMR cluster. I am following this documentation for the configuration: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-debugging.html#emr-plan-debugging-logs-archive Under the section titled: "To aggregate logs in Amazon S3 using the AWS CLI". I've verified that the hadoop-config bootstrap action puts the following in yarn-site.xml <property><name>yarn.log-aggregation-enable</name><value>true</value><

How does MapReduce read from multiple input files?

旧时模样 提交于 2019-12-05 06:47:41
问题 I am developing a code to read data and write it into HDFS using mapreduce . However when I have multiple files I don't understand how it is processed . The input path to the mapper is the name of the directory as evident from the output of String filename = conf1.get("map.input.file"); So how does it process the files in the directory ? 回答1: In order to get the input file path you can use the context object, like this: FileSplit fileSplit = (FileSplit) context.getInputSplit(); String