amazon-emr

collect() or toPandas() on a large DataFrame in pyspark/EMR

喜欢而已 提交于 2019-11-27 16:16:19
I have an EMR cluster of one machine "c3.8xlarge", after reading several resources, I understood that I have to allow decent amount of memory off-heap because I am using pyspark, so I have configured the cluster as follow: One executor: spark.executor.memory 6g spark.executor.cores 10 spark.yarn.executor.memoryOverhead 4096 Driver: spark.driver.memory 21g When I cache() the DataFrame it takes about 3.6GB of memory. Now when I call collect() or toPandas() on the DataFrame, the process crashes. I know that I am bringing a large amount of data into the driver, but I think that it is not that

Strange spark ERROR on AWS EMR

情到浓时终转凉″ 提交于 2019-11-27 11:14:11
问题 I have a really simple PySpark script that creates a dataframe from some parquet data on S3 and then call count() method and print out the number of records. I run the script on AWS EMR cluster and I'm seeing following strange WARN information: 17/12/04 14:20:26 WARN ServletHandler: javax.servlet.ServletException: java.util.NoSuchElementException: None.get at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:489) at org.glassfish.jersey.servlet.WebComponent.service

Dealing with a large gzipped file in Spark

自闭症网瘾萝莉.ら 提交于 2019-11-27 04:42:51
问题 I have a large (about 85 GB compressed) gzipped file from s3 that I am trying to process with Spark on AWS EMR (right now with an m4.xlarge master instance and two m4.10xlarge core instances each with a 100 GB EBS volume). I am aware that gzip is a non-splittable file format, and I've seen it suggested that one should repartition the compressed file because Spark initially gives an RDD with one partition. However, after doing scala> val raw = spark.read.format("com.databricks.spark.csv"). |

Amazon EC2 vs. Amazon EMR [closed]

坚强是说给别人听的谎言 提交于 2019-11-27 01:20:43
问题 I have implemented a task in Hive. Currently it is working fine on my single node cluster. Now I am planning to deploy it on AWS. I don't know anything about the AWS. If I plan to deploy it then what should I choose Amazon EC2 or Amazon EMR? I want to improve the performance of my task. Which one is better and reliable for me? How to approach towards them? I heard that we can also register our VM setting as it is on AWS. Is it possible? Please suggest me as soon as possible. Many Thanks. 回答1:

Application report for application_ (state: ACCEPTED) never ends for Spark Submit (with Spark 1.2.0 on YARN)

别等时光非礼了梦想. 提交于 2019-11-26 19:33:35
问题 I am running kinesis plus spark application https://spark.apache.org/docs/1.2.0/streaming-kinesis-integration.html I am running as below command on ec2 instance : ./spark/bin/spark-submit --class org.apache.spark.examples.streaming.myclassname --master yarn-cluster --num-executors 2 --driver-memory 1g --executor-memory 1g --executor-cores 1 /home/hadoop/test.jar I have installed spark on EMR. EMR details Master instance group - 1 Running MASTER m1.medium 1 Core instance group - 2 Running CORE

Saving dataframe to local file system results in empty results

亡梦爱人 提交于 2019-11-26 14:48:39
问题 We are running spark 2.3.0 on AWW EMR . The following DataFrame " df " is non empty and of modest size: scala> df.count res0: Long = 4067 The following code works fine for writing df to hdfs : scala> val hdf = spark.read.parquet("/tmp/topVendors") hdf: org.apache.spark.sql.DataFrame = [displayName: string, cnt: bigint] scala> hdf.count res4: Long = 4067 However using the same code to write to a local parquet or csv file end up with empty results: df.repartition(1).write.mode("overwrite")

How to submit Spark jobs to EMR cluster from Airflow?

女生的网名这么多〃 提交于 2019-11-26 04:51:17
问题 How can I establish a connection between EMR master cluster(created by Terraform) and Airflow. I have Airflow setup under AWS EC2 server with same SG,VPC and Subnet. I need solutions so that Airflow can talk to EMR and execute Spark submit. https://aws.amazon.com/blogs/big-data/build-a-concurrent-data-orchestration-pipeline-using-amazon-emr-and-apache-livy/ These blogs have understanding on execution after connection has been established.(Didn\'t help much) In airflow I have made a connection