spark-submit | 易学教程

Apache Spark — using spark-submit throws a NoSuchMethodError

阅读更多关于 Apache Spark — using spark-submit throws a NoSuchMethodError

问题 To submit a Spark application to a cluster, their documentation notes: To do this, create an assembly jar (or “uber” jar) containing your code and its dependencies. Both sbt and Maven have assembly plugins. When creating assembly jars, list Spark and Hadoop as provided dependencies; these need not be bundled since they are provided by the cluster manager at runtime. -- http://spark.apache.org/docs/latest/submitting-applications.html So, I added the Apache Maven Shade Plugin to my pom.xml file

How to drop messages in console when using spark-submit? [duplicate]

阅读更多关于 How to drop messages in console when using spark-submit? [duplicate]

问题 This question already has answers here : How to stop INFO messages displaying on spark console? (19 answers) Closed 3 years ago . When I run spark-submit job with scala , I can see a lot of status messages in console. But I would like to see only my prints. Can I put any parameter in order not to see these messages? 回答1: This should do the trick for the most part. Put it inside the code: import org.apache.log4j.{Level, Logger} Logger.getLogger("org").setLevel(Level.WARN) Logger.getLogger(

Kafka Stream to Spark Stream python

阅读更多关于 Kafka Stream to Spark Stream python

问题 We have Kafka stream which use Avro. I need to connect it to Spark Stream. I use bellow code as Lev G suggest. kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers}, valueDecoder=MessageSerializer.decode_message) I got bellow error when i execute it through spark-submit. 2018-10-09 10:49:27 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:66 - Requesting driver to remove executor 12 for reason Container marked as failed: container_1537396420651_0008_01_000013 on

spark-submit with specific python librairies

阅读更多关于 spark-submit with specific python librairies

问题 I have a pyspark code depending on third party librairies. I want to execute this code on my cluster which run under mesos. I do have a zipped version of my python environment that is on a http server reachable by my cluster. I have some trouble to specify to my spark-submit query to use this environment. I use both --archives to load the zip file and --conf 'spark.pyspark.driver.python=path/to/my/env/bin/python' plus --conf 'spark.pyspark.python=path/to/my/env/bin/python' to specify the

spark-submit with specific python librairies

阅读更多关于 spark-submit with specific python librairies

spark-submit config through file

阅读更多关于 spark-submit config through file

问题 I am trying to deploy spark job by using spark-submit which has bunch of parameters like spark-submit --class Eventhub --master yarn --deploy-mode cluster --executor-memory 1024m --executor-cores 4 --files app.conf spark-hdfs-assembly-1.0.jar --conf "app.conf" I was looking a way to put all these flags in file to pass to spark-submit to make my spark-submit command simple liek this spark-submit --class Eventhub --master yarn --deploy-mode cluster --config-file my-app.cfg --files app.conf

join two dataframe without having common column spark, scala

阅读更多关于 join two dataframe without having common column spark, scala

问题 I have two dataframes which has different types of columns. I need to join those two different dataframe. Please refer the below example val df1 has Customer_name Customer_phone Customer_age val df2 has Order_name Order_ID These two dataframe doesn't have any common column. Number of rows and Number of columns in the two dataframes also differs. I tried to insert a new dummy column to increase the row_index value as below val dfr=df1.withColumn("row_index",monotonically_increasing_id()). But

Spark standalone connection driver to worker

阅读更多关于 Spark standalone connection driver to worker

问题 I'm trying to host locally a spark standalone cluster. I have two heterogeneous machines connected on a LAN. Each piece of the architecture listed below is running on docker. I have the following configuration master on machine 1 (port 7077 exposed) worker on machine 1 driver on machine 2 I use a test application that opens a file and counts its lines. The application works when the file replicated on all workers and I use SparkContext.readText() But when when the file is only present on

How to execute spark submit on amazon EMR from Lambda function?

阅读更多关于 How to execute spark submit on amazon EMR from Lambda function?

问题 I want to execute spark submit job on AWS EMR cluster based on the file upload event on S3. I am using AWS Lambda function to capture the event but I have no idea how to submit spark submit job on EMR cluster from Lambda function. Most of the answers that i searched talked about adding a step in the EMR cluster. But I do not know if I can add add any step to fire "spark submit --with args" in the added step. 回答1: You can, I had to same thing last week! Using boto3 for Python (other languages

How to reference .so files in spark-submit command

阅读更多关于 How to reference .so files in spark-submit command

问题 I am using TimesTen Database with Spark 2.3.0 I need to refer to .so files in spark-submit command in order to connect to Timesten db. Is there any option for same in spark-submit ? I tried adding so file in --conf spark.executor.extraLibraryPath still it doesn't resolve the error. Error I am getting is : Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 135 in stage 8.0 failed 4 times, most recent failure: Lost task 135.3 in stage 8.0 (TID