spark-submit | 易学教程

add a python external library in Pyspark

阅读更多关于 add a python external library in Pyspark

问题 I'm using pyspark (1.6) and i want to use databricks:spark-csv library. For this i've tried different ways with no success 1- i've tried to add a jar i downloaded from https://spark-packages.org/package/databricks/spark-csv, and run pyspark --jars THE_NAME_OF_THE_JAR df = sqlContext.read.format('com.databricks:spark-csv').options(header='true', inferschema='true').load('/dlk/doaat/nsi_dev/utilisateur/referentiel/refecart.csv') But got this error : Traceback (most recent call last): File "

List of spark-submit options

阅读更多关于 List of spark-submit options

问题 There are a ton of tunable settings mentioned on Spark configurations page. However as told here, the SparkSubmitOptionParser attribute-name for a Spark property can be different from that property's-name . For instance, spark.executor.cores is passed as --executor-cores in spark-submit . Where can I find an exhaustive list of all tuning parameters of Spark (along-with their SparkSubmitOptionParser property name) that can be passed with spark-submit command? 回答1: While @suj1th 's valuable

AWS EMR using spark steps in cluster mode. Application application_ finished with failed status

阅读更多关于 AWS EMR using spark steps in cluster mode. Application application_ finished with failed status

问题 I'm trying to launch a cluster using AWS Cli. I use the following command: aws emr create-cluster --name "Config1" --release-label emr-5.0.0 --applications Name=Spark --use-default-role --log-uri 's3://aws-logs-813591802533-us-west-2/elasticmapreduce/' --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.medium InstanceGroupType=CORE,InstanceCount=2,InstanceType=m1.medium The cluster is created successfully. Then I add this command: aws emr add-steps --cluster-id ID

How to append a resource jar for spark-submit?

阅读更多关于 How to append a resource jar for spark-submit?

问题 My spark application depends on adam_2.11-0.20.0.jar, every time I have to package my application with adam_2.11-0.20.0.jar as a fat jar to submit to spark. for example, my fat jar is myApp1-adam_2.11-0.20.0.jar, It's ok to submit as following spark-submit --class com.ano.adam.AnnoSp myApp1-adam_2.11-0.20.0.jar It reported Exception in thread "main" java.lang.NoClassDefFoundError: org/bdgenomics/adam/rdd using --jars spark-submit --class com.ano.adam.AnnoSp myApp1.jar --jars adam_2.11-0.20.0

Copy files (config) from HDFS to local working directory of every spark executor

阅读更多关于 Copy files (config) from HDFS to local working directory of every spark executor

问题 I am looking how to copy a folder with files of resource dependencies from HDFS to a local working directory of each spark executor using Java. I was at first thinking of using --files FILES option of spark-submit but it seems it does not support folders of files of arbitrary nesting. So, it appears I have to do it via putting this folder on a shared HDFS path to be copied correctly by each executor to its working directory before running a job but yet to find out how to do it correctly in

How to save a file on the cluster

阅读更多关于 How to save a file on the cluster

问题 I'm connected to the cluster using ssh and I send the program to the cluster using spark-submit --master yarn myProgram.py I want to save the result in a text file and I tried using the following lines: counts.write.json("hdfs://home/myDir/text_file.txt") counts.write.csv("hdfs://home/myDir/text_file.csv") However, none of them work. The program finishes and I cannot find the text file in myDir . Do you have any idea how can I do this? Also, is there a way to write directly to my local

Failed to submit local jar to spark cluster: java.nio.file.NoSuchFileException

阅读更多关于 Failed to submit local jar to spark cluster: java.nio.file.NoSuchFileException

问题 ~/spark/spark-2.1.1-bin-hadoop2.7/bin$ ./spark-submit --master spark://192.168.42.80:32141 --deploy-mode cluster file:///home/me/workspace/myproj/target/scala-2.11/myproj-assembly-0.1.0.jar Running Spark using the REST application submission protocol. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 17/06/20 16:41:30 INFO RestSubmissionClient: Submitting a request to launch an application in spark://192.168.42.80:32141. 17/06/20 16:41:31 INFO

spark-submit error: ClassNotFoundException

阅读更多关于 spark-submit error: ClassNotFoundException

问题 build.sbt lazy val commonSettings = Seq( organization := "com.me", version := "0.1.0", scalaVersion := "2.11.0" ) lazy val counter = (project in file("counter")). settings(commonSettings:_*) counter/build.sbt name := "counter" mainClass := Some("Counter") scalaVersion := "2.11.0" val sparkVersion = "2.1.1"; libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion % "provided"; libraryDependencies += "org.apache.spark" %% "spark-sql" % sparkVersion % "provided";

my spark sql limit is very slow

阅读更多关于 my spark sql limit is very slow

问题 I use spark to read from elasticsearch.Like select col from index limit 10; The problem is that the index is very large, it contains 100 billion rows.And spark generate thousands of tasks to finish the job. All I need is 10 rows, even 1 tasks returns 10 rows that can finish the job.I don't need so many tasks. Limit is very slow even limit 1. Code： sql = select col from index limit 10 sqlExecListener.sparkSession.sql(sql).createOrReplaceTempView(tempTable) 回答1: The source code of limit shows

spark Yarn mode how to get applicationId from spark-submit

阅读更多关于 spark Yarn mode how to get applicationId from spark-submit

问题 When I submit spark job using spark-submit with master yarn and deploy-mode cluster, it doesn't print/return any applicationId and once job is completed I have to manually check MapReduce jobHistory or spark HistoryServer to get the job details. My cluster is used by many users and it takes lot of time to spot my job in jobHistory/HistoryServer. is there any way to configure spark-submit to return the applicationId? Note: I found many similar questions but their solutions retrieve