apache-spark

Spark Streaming Exception: java.util.NoSuchElementException: None.get

我怕爱的太早我们不能终老 提交于 2021-01-27 06:33:10
问题 I am writing SparkStreaming data to HDFS by converting it to a dataframe: Code object KafkaSparkHdfs { val sparkConf = new SparkConf().setMaster("local[*]").setAppName("SparkKafka") sparkConf.set("spark.driver.allowMultipleContexts", "true"); val sc = new SparkContext(sparkConf) def main(args: Array[String]): Unit = { val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ val ssc = new StreamingContext(sparkConf, Seconds(20)) val kafkaParams = Map[String,

How to resolve : Very large size tasks in spark

强颜欢笑 提交于 2021-01-27 06:23:09
问题 Here I am pasting my python code which I am running on spark in order to perform some analysis on data. I am able to run the following program on small amount of data-set. But when coming large data-set, it is saying "Stage 1 contains a task of very large size (17693 KB). The maximum recommended task size is 100 KB". import os import sys import unicodedata from operator import add try: from pyspark import SparkConf from pyspark import SparkContext except ImportError as e: print ("Error

How can see the SQL statements that SPARK sends to my database?

末鹿安然 提交于 2021-01-27 06:16:46
问题 I have a spark cluster and a vertica database. I use spark.read.jdbc( # etc to load Spark dataframes into the cluster. When I do a certain groupby function df2 = df.groupby('factor').agg(F.stddev('sum(PnL)')) df2.show() I then get a vertica syntax exception Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler

Creating a Java Object in Scala

最后都变了- 提交于 2021-01-27 05:51:46
问题 I have a Java class "Listings". I use this in my Java MapReduce job as below: public void map(Object key, Text value, Context context) throws IOException, InterruptedException { Listings le = new Listings(value.toString()); ... } I want to run the same job on Spark. So, I am writing this in Scala now. I imported the Java class: import src.main.java.lists.Listings I want to create a Listings object in Scala. I am doing this: val file_le = sc.textFile("file// Path to file") Listings lists = new

How does Spark SQL read compressed csv files?

早过忘川 提交于 2021-01-27 05:43:11
问题 I have tried with api spark.read.csv to read compressed csv file with extension bz or gzip . It worked. But in source code I don't find any option parameter that we can declare the codec type. Even in this link, there is only setting for codec in writing side. Could anyone tell me or give the path to source code that showing how spark 2.x version deal with the compressed csv file. 回答1: All text-related data sources, including CSVDataSource, use Hadoop File API to deal with files (it was in

How does Spark SQL read compressed csv files?

南楼画角 提交于 2021-01-27 05:42:58
问题 I have tried with api spark.read.csv to read compressed csv file with extension bz or gzip . It worked. But in source code I don't find any option parameter that we can declare the codec type. Even in this link, there is only setting for codec in writing side. Could anyone tell me or give the path to source code that showing how spark 2.x version deal with the compressed csv file. 回答1: All text-related data sources, including CSVDataSource, use Hadoop File API to deal with files (it was in

Spark could not bind on port 7077 with public IP

爱⌒轻易说出口 提交于 2021-01-27 05:41:47
问题 I have installed spark on AWS. When I try to execute on AWS it works, but spark doesn't work, when I check the sparkMaster log I see the next: Spark Command: /usr/lib/jvm/java-8-oracle/jre/bin/java -cp /home/ubuntu/spark/conf/:/home/ubuntu/spark/jars/* -Xmx1g org.apache.spark$ ======================================== Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 16/09/12 09:40:18 INFO Master: Started daemon with process name: 5451@server1 16/09/12 09:40:18

Spark could not bind on port 7077 with public IP

妖精的绣舞 提交于 2021-01-27 05:41:28
问题 I have installed spark on AWS. When I try to execute on AWS it works, but spark doesn't work, when I check the sparkMaster log I see the next: Spark Command: /usr/lib/jvm/java-8-oracle/jre/bin/java -cp /home/ubuntu/spark/conf/:/home/ubuntu/spark/jars/* -Xmx1g org.apache.spark$ ======================================== Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 16/09/12 09:40:18 INFO Master: Started daemon with process name: 5451@server1 16/09/12 09:40:18

Error creating transactional connection factory during running Spark on Hive project in IDEA

一笑奈何 提交于 2021-01-27 04:51:34
问题 I am trying to setup a develop environment for a Spark Streaming project which requires write data into Hive. I have a cluster with 1 master, 2 slaves and 1 develop machine (coding in Intellij Idea 14). Within the spark shell, everything seems working fine and I am able to store data into default database in Hive via Spark 1.5 using DataFrame.write.insertInto("testtable") However when creating a scala project in IDEA and run it using same cluster with same setting, Error was thrown when

Error creating transactional connection factory during running Spark on Hive project in IDEA

…衆ロ難τιáo~ 提交于 2021-01-27 04:49:52
问题 I am trying to setup a develop environment for a Spark Streaming project which requires write data into Hive. I have a cluster with 1 master, 2 slaves and 1 develop machine (coding in Intellij Idea 14). Within the spark shell, everything seems working fine and I am able to store data into default database in Hive via Spark 1.5 using DataFrame.write.insertInto("testtable") However when creating a scala project in IDEA and run it using same cluster with same setting, Error was thrown when