spark-submit

add a python external library in Pyspark

强颜欢笑 提交于 2019-12-24 03:39:27
问题 I'm using pyspark (1.6) and i want to use databricks:spark-csv library. For this i've tried different ways with no success 1- i've tried to add a jar i downloaded from https://spark-packages.org/package/databricks/spark-csv, and run pyspark --jars THE_NAME_OF_THE_JAR df = sqlContext.read.format('com.databricks:spark-csv').options(header='true', inferschema='true').load('/dlk/doaat/nsi_dev/utilisateur/referentiel/refecart.csv') But got this error : Traceback (most recent call last): File "

List of spark-submit options

时间秒杀一切 提交于 2019-12-23 01:43:33
问题 There are a ton of tunable settings mentioned on Spark configurations page. However as told here, the SparkSubmitOptionParser attribute-name for a Spark property can be different from that property's-name . For instance, spark.executor.cores is passed as --executor-cores in spark-submit . Where can I find an exhaustive list of all tuning parameters of Spark (along-with their SparkSubmitOptionParser property name) that can be passed with spark-submit command? 回答1: While @suj1th 's valuable

AWS EMR using spark steps in cluster mode. Application application_ finished with failed status

跟風遠走 提交于 2019-12-20 05:12:20
问题 I'm trying to launch a cluster using AWS Cli. I use the following command: aws emr create-cluster --name "Config1" --release-label emr-5.0.0 --applications Name=Spark --use-default-role --log-uri 's3://aws-logs-813591802533-us-west-2/elasticmapreduce/' --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.medium InstanceGroupType=CORE,InstanceCount=2,InstanceType=m1.medium The cluster is created successfully. Then I add this command: aws emr add-steps --cluster-id ID

How to append a resource jar for spark-submit?

╄→尐↘猪︶ㄣ 提交于 2019-12-20 04:59:34
问题 My spark application depends on adam_2.11-0.20.0.jar, every time I have to package my application with adam_2.11-0.20.0.jar as a fat jar to submit to spark. for example, my fat jar is myApp1-adam_2.11-0.20.0.jar, It's ok to submit as following spark-submit --class com.ano.adam.AnnoSp myApp1-adam_2.11-0.20.0.jar It reported Exception in thread "main" java.lang.NoClassDefFoundError: org/bdgenomics/adam/rdd using --jars spark-submit --class com.ano.adam.AnnoSp myApp1.jar --jars adam_2.11-0.20.0

Copy files (config) from HDFS to local working directory of every spark executor

坚强是说给别人听的谎言 提交于 2019-12-19 11:25:29
问题 I am looking how to copy a folder with files of resource dependencies from HDFS to a local working directory of each spark executor using Java. I was at first thinking of using --files FILES option of spark-submit but it seems it does not support folders of files of arbitrary nesting. So, it appears I have to do it via putting this folder on a shared HDFS path to be copied correctly by each executor to its working directory before running a job but yet to find out how to do it correctly in

How to save a file on the cluster

。_饼干妹妹 提交于 2019-12-18 07:39:28
问题 I'm connected to the cluster using ssh and I send the program to the cluster using spark-submit --master yarn myProgram.py I want to save the result in a text file and I tried using the following lines: counts.write.json("hdfs://home/myDir/text_file.txt") counts.write.csv("hdfs://home/myDir/text_file.csv") However, none of them work. The program finishes and I cannot find the text file in myDir . Do you have any idea how can I do this? Also, is there a way to write directly to my local

Failed to submit local jar to spark cluster: java.nio.file.NoSuchFileException

时光总嘲笑我的痴心妄想 提交于 2019-12-13 13:53:08
问题 ~/spark/spark-2.1.1-bin-hadoop2.7/bin$ ./spark-submit --master spark://192.168.42.80:32141 --deploy-mode cluster file:///home/me/workspace/myproj/target/scala-2.11/myproj-assembly-0.1.0.jar Running Spark using the REST application submission protocol. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 17/06/20 16:41:30 INFO RestSubmissionClient: Submitting a request to launch an application in spark://192.168.42.80:32141. 17/06/20 16:41:31 INFO

spark-submit error: ClassNotFoundException

让人想犯罪 __ 提交于 2019-12-12 06:48:21
问题 build.sbt lazy val commonSettings = Seq( organization := "com.me", version := "0.1.0", scalaVersion := "2.11.0" ) lazy val counter = (project in file("counter")). settings(commonSettings:_*) counter/build.sbt name := "counter" mainClass := Some("Counter") scalaVersion := "2.11.0" val sparkVersion = "2.1.1"; libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion % "provided"; libraryDependencies += "org.apache.spark" %% "spark-sql" % sparkVersion % "provided";

my spark sql limit is very slow

有些话、适合烂在心里 提交于 2019-12-08 21:36:30
问题 I use spark to read from elasticsearch.Like select col from index limit 10; The problem is that the index is very large, it contains 100 billion rows.And spark generate thousands of tasks to finish the job. All I need is 10 rows, even 1 tasks returns 10 rows that can finish the job.I don't need so many tasks. Limit is very slow even limit 1. Code: sql = select col from index limit 10 sqlExecListener.sparkSession.sql(sql).createOrReplaceTempView(tempTable) 回答1: The source code of limit shows

spark Yarn mode how to get applicationId from spark-submit

那年仲夏 提交于 2019-12-08 02:04:25
问题 When I submit spark job using spark-submit with master yarn and deploy-mode cluster, it doesn't print/return any applicationId and once job is completed I have to manually check MapReduce jobHistory or spark HistoryServer to get the job details. My cluster is used by many users and it takes lot of time to spot my job in jobHistory/HistoryServer. is there any way to configure spark-submit to return the applicationId? Note: I found many similar questions but their solutions retrieve