amazon-emr | 易学教程

collect() or toPandas() on a large DataFrame in pyspark/EMR

阅读更多关于 collect() or toPandas() on a large DataFrame in pyspark/EMR

I have an EMR cluster of one machine "c3.8xlarge", after reading several resources, I understood that I have to allow decent amount of memory off-heap because I am using pyspark, so I have configured the cluster as follow: One executor: spark.executor.memory 6g spark.executor.cores 10 spark.yarn.executor.memoryOverhead 4096 Driver: spark.driver.memory 21g When I cache() the DataFrame it takes about 3.6GB of memory. Now when I call collect() or toPandas() on the DataFrame, the process crashes. I know that I am bringing a large amount of data into the driver, but I think that it is not that

Strange spark ERROR on AWS EMR

阅读更多关于 Strange spark ERROR on AWS EMR

问题 I have a really simple PySpark script that creates a dataframe from some parquet data on S3 and then call count() method and print out the number of records. I run the script on AWS EMR cluster and I'm seeing following strange WARN information: 17/12/04 14:20:26 WARN ServletHandler: javax.servlet.ServletException: java.util.NoSuchElementException: None.get at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:489) at org.glassfish.jersey.servlet.WebComponent.service

Dealing with a large gzipped file in Spark

阅读更多关于 Dealing with a large gzipped file in Spark

问题 I have a large (about 85 GB compressed) gzipped file from s3 that I am trying to process with Spark on AWS EMR (right now with an m4.xlarge master instance and two m4.10xlarge core instances each with a 100 GB EBS volume). I am aware that gzip is a non-splittable file format, and I've seen it suggested that one should repartition the compressed file because Spark initially gives an RDD with one partition. However, after doing scala> val raw = spark.read.format("com.databricks.spark.csv"). |

Amazon EC2 vs. Amazon EMR [closed]

阅读更多关于 Amazon EC2 vs. Amazon EMR [closed]

问题 I have implemented a task in Hive. Currently it is working fine on my single node cluster. Now I am planning to deploy it on AWS. I don't know anything about the AWS. If I plan to deploy it then what should I choose Amazon EC2 or Amazon EMR? I want to improve the performance of my task. Which one is better and reliable for me? How to approach towards them? I heard that we can also register our VM setting as it is on AWS. Is it possible? Please suggest me as soon as possible. Many Thanks. 回答1:

Application report for application_ (state: ACCEPTED) never ends for Spark Submit (with Spark 1.2.0 on YARN)

阅读更多关于 Application report for application_ (state: ACCEPTED) never ends for Spark Submit (with Spark 1.2.0 on YARN)

问题 I am running kinesis plus spark application https://spark.apache.org/docs/1.2.0/streaming-kinesis-integration.html I am running as below command on ec2 instance : ./spark/bin/spark-submit --class org.apache.spark.examples.streaming.myclassname --master yarn-cluster --num-executors 2 --driver-memory 1g --executor-memory 1g --executor-cores 1 /home/hadoop/test.jar I have installed spark on EMR. EMR details Master instance group - 1 Running MASTER m1.medium 1 Core instance group - 2 Running CORE

Saving dataframe to local file system results in empty results

阅读更多关于 Saving dataframe to local file system results in empty results

问题 We are running spark 2.3.0 on AWW EMR . The following DataFrame " df " is non empty and of modest size: scala> df.count res0: Long = 4067 The following code works fine for writing df to hdfs : scala> val hdf = spark.read.parquet("/tmp/topVendors") hdf: org.apache.spark.sql.DataFrame = [displayName: string, cnt: bigint] scala> hdf.count res4: Long = 4067 However using the same code to write to a local parquet or csv file end up with empty results: df.repartition(1).write.mode("overwrite")

How to submit Spark jobs to EMR cluster from Airflow?

阅读更多关于 How to submit Spark jobs to EMR cluster from Airflow?

问题 How can I establish a connection between EMR master cluster(created by Terraform) and Airflow. I have Airflow setup under AWS EC2 server with same SG,VPC and Subnet. I need solutions so that Airflow can talk to EMR and execute Spark submit. https://aws.amazon.com/blogs/big-data/build-a-concurrent-data-orchestration-pipeline-using-amazon-emr-and-apache-livy/ These blogs have understanding on execution after connection has been established.(Didn\'t help much) In airflow I have made a connection