spark-submit

Pass system property to spark-submit and read file from classpath or custom path

限于喜欢 提交于 2019-12-06 22:58:57
问题 I have recently found a way to use logback instead of log4j in Apache Spark (both for local use and spark-submit ). However, there is last piece missing. The issue is that Spark tries very hard not to see logback.xml settings in its classpath. I have already found a way to load it during local execution: What I have so far Basically, checking for System property logback.configurationFile , but loading logback.xml from my /src/main/resources/ just in case: // the same as default: https:/

Spark java.lang.OutOfMemoryError : Java Heap space

人走茶凉 提交于 2019-12-05 10:50:31
问题 I am geting the above error when i run a model training pipeline with spark `val inputData = spark.read .option("header", true) .option("mode","DROPMALFORMED") .csv(input) .repartition(500) .toDF("b", "c") .withColumn("b", lower(col("b"))) .withColumn("c", lower(col("c"))) .toDF("b", "c") .na.drop()` inputData has about 25 million rows and is about 2gb in size. the model building phase happens like so val tokenizer = new Tokenizer() .setInputCol("c") .setOutputCol("tokens") val cvSpec = new

Pass system property to spark-submit and read file from classpath or custom path

拈花ヽ惹草 提交于 2019-12-05 03:27:39
I have recently found a way to use logback instead of log4j in Apache Spark (both for local use and spark-submit ). However, there is last piece missing. The issue is that Spark tries very hard not to see logback.xml settings in its classpath. I have already found a way to load it during local execution: What I have so far Basically, checking for System property logback.configurationFile , but loading logback.xml from my /src/main/resources/ just in case: // the same as default: https://logback.qos.ch/manual/configuration.html private val LogbackLocation = Option(System.getProperty("logback

Spark java.lang.OutOfMemoryError : Java Heap space

这一生的挚爱 提交于 2019-12-03 23:06:13
I am geting the above error when i run a model training pipeline with spark `val inputData = spark.read .option("header", true) .option("mode","DROPMALFORMED") .csv(input) .repartition(500) .toDF("b", "c") .withColumn("b", lower(col("b"))) .withColumn("c", lower(col("c"))) .toDF("b", "c") .na.drop()` inputData has about 25 million rows and is about 2gb in size. the model building phase happens like so val tokenizer = new Tokenizer() .setInputCol("c") .setOutputCol("tokens") val cvSpec = new CountVectorizer() .setInputCol("tokens") .setOutputCol("features") .setMinDF(minDF) .setVocabSize

SparkException using JavaStreamingContext.getOrCreate(): Only one SparkContext may be running in this JVM

喜你入骨 提交于 2019-12-02 19:53:11
问题 Related to this question, I got the tip that the getOrCreate idiom should be used to avoid this issues. But trying: JavaStreamingContextFactory contextFactory = new JavaStreamingContextFactory() { @Override public JavaStreamingContext create() { final SparkConf conf = new SparkConf().setAppName(NAME); return new JavaStreamingContext(conf, Durations.seconds(BATCH_SPAN)); } }; final JavaStreamingContext context = JavaStreamingContext.getOrCreate("/tmp/"+NAME, contextFactory); I'm still getting:

SparkException using JavaStreamingContext.getOrCreate(): Only one SparkContext may be running in this JVM

拜拜、爱过 提交于 2019-12-02 09:22:54
Related to this question , I got the tip that the getOrCreate idiom should be used to avoid this issues. But trying: JavaStreamingContextFactory contextFactory = new JavaStreamingContextFactory() { @Override public JavaStreamingContext create() { final SparkConf conf = new SparkConf().setAppName(NAME); return new JavaStreamingContext(conf, Durations.seconds(BATCH_SPAN)); } }; final JavaStreamingContext context = JavaStreamingContext.getOrCreate("/tmp/"+NAME, contextFactory); I'm still getting: Exception in thread "main" org.apache.spark.SparkException: Only one SparkContext may be running in

How to append a resource jar for spark-submit?

女生的网名这么多〃 提交于 2019-12-02 04:53:15
My spark application depends on adam_2.11-0.20.0.jar, every time I have to package my application with adam_2.11-0.20.0.jar as a fat jar to submit to spark. for example, my fat jar is myApp1-adam_2.11-0.20.0.jar, It's ok to submit as following spark-submit --class com.ano.adam.AnnoSp myApp1-adam_2.11-0.20.0.jar It reported Exception in thread "main" java.lang.NoClassDefFoundError: org/bdgenomics/adam/rdd using --jars spark-submit --class com.ano.adam.AnnoSp myApp1.jar --jars adam_2.11-0.20.0.jar My question is how to submit using 2 separate jars without package them together spark-submit -

Spark asynchronous job fails with error

自古美人都是妖i 提交于 2019-12-01 14:48:15
I'm writing code for spark in java. When I use foreachAsync spark fails and gives me java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext. In this code: JavaSparkContext sparkContext = new JavaSparkContext("local","MyAppName"); JavaPairRDD<String, String> wholeTextFiles = sparkContext.wholeTextFiles("somePath"); wholeTextFiles.foreach(new VoidFunction<Tuple2<String, String>>() { public void call(Tuple2<String, String> stringStringTuple2) throws Exception { //do something } }); It works fine. But in this code: JavaSparkContext sparkContext = new JavaSparkContext("local

join two dataframe without having common column spark, scala

纵然是瞬间 提交于 2019-12-01 13:49:11
I have two dataframes which has different types of columns. I need to join those two different dataframe. Please refer the below example val df1 has Customer_name Customer_phone Customer_age val df2 has Order_name Order_ID These two dataframe doesn't have any common column. Number of rows and Number of columns in the two dataframes also differs. I tried to insert a new dummy column to increase the row_index value as below val dfr=df1.withColumn("row_index",monotonically_increasing_id()). But As i am using spark-2, monotonically_increasing_id method is not supporting for me. Is there any way to

Spark asynchronous job fails with error

半腔热情 提交于 2019-12-01 13:32:08
问题 I'm writing code for spark in java. When I use foreachAsync spark fails and gives me java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext. In this code: JavaSparkContext sparkContext = new JavaSparkContext("local","MyAppName"); JavaPairRDD<String, String> wholeTextFiles = sparkContext.wholeTextFiles("somePath"); wholeTextFiles.foreach(new VoidFunction<Tuple2<String, String>>() { public void call(Tuple2<String, String> stringStringTuple2) throws Exception { //do