apache-spark | 易学教程

Spark Streaming Exception: java.util.NoSuchElementException: None.get

阅读更多关于 Spark Streaming Exception: java.util.NoSuchElementException: None.get

问题 I am writing SparkStreaming data to HDFS by converting it to a dataframe: Code object KafkaSparkHdfs { val sparkConf = new SparkConf().setMaster("local[*]").setAppName("SparkKafka") sparkConf.set("spark.driver.allowMultipleContexts", "true"); val sc = new SparkContext(sparkConf) def main(args: Array[String]): Unit = { val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ val ssc = new StreamingContext(sparkConf, Seconds(20)) val kafkaParams = Map[String,

How to resolve : Very large size tasks in spark

阅读更多关于 How to resolve : Very large size tasks in spark

问题 Here I am pasting my python code which I am running on spark in order to perform some analysis on data. I am able to run the following program on small amount of data-set. But when coming large data-set, it is saying "Stage 1 contains a task of very large size (17693 KB). The maximum recommended task size is 100 KB". import os import sys import unicodedata from operator import add try: from pyspark import SparkConf from pyspark import SparkContext except ImportError as e: print ("Error

How can see the SQL statements that SPARK sends to my database?

阅读更多关于 How can see the SQL statements that SPARK sends to my database?

问题 I have a spark cluster and a vertica database. I use spark.read.jdbc( # etc to load Spark dataframes into the cluster. When I do a certain groupby function df2 = df.groupby('factor').agg(F.stddev('sum(PnL)')) df2.show() I then get a vertica syntax exception Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler

Creating a Java Object in Scala

阅读更多关于 Creating a Java Object in Scala

问题 I have a Java class "Listings". I use this in my Java MapReduce job as below: public void map(Object key, Text value, Context context) throws IOException, InterruptedException { Listings le = new Listings(value.toString()); ... } I want to run the same job on Spark. So, I am writing this in Scala now. I imported the Java class: import src.main.java.lists.Listings I want to create a Listings object in Scala. I am doing this: val file_le = sc.textFile("file// Path to file") Listings lists = new

How does Spark SQL read compressed csv files?

阅读更多关于 How does Spark SQL read compressed csv files?

问题 I have tried with api spark.read.csv to read compressed csv file with extension bz or gzip . It worked. But in source code I don't find any option parameter that we can declare the codec type. Even in this link, there is only setting for codec in writing side. Could anyone tell me or give the path to source code that showing how spark 2.x version deal with the compressed csv file. 回答1: All text-related data sources, including CSVDataSource, use Hadoop File API to deal with files (it was in

How does Spark SQL read compressed csv files?

阅读更多关于 How does Spark SQL read compressed csv files?

Spark could not bind on port 7077 with public IP

阅读更多关于 Spark could not bind on port 7077 with public IP

问题 I have installed spark on AWS. When I try to execute on AWS it works, but spark doesn't work, when I check the sparkMaster log I see the next: Spark Command: /usr/lib/jvm/java-8-oracle/jre/bin/java -cp /home/ubuntu/spark/conf/:/home/ubuntu/spark/jars/* -Xmx1g org.apache.spark$ ======================================== Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 16/09/12 09:40:18 INFO Master: Started daemon with process name: 5451@server1 16/09/12 09:40:18

Spark could not bind on port 7077 with public IP

阅读更多关于 Spark could not bind on port 7077 with public IP

Error creating transactional connection factory during running Spark on Hive project in IDEA

阅读更多关于 Error creating transactional connection factory during running Spark on Hive project in IDEA

问题 I am trying to setup a develop environment for a Spark Streaming project which requires write data into Hive. I have a cluster with 1 master, 2 slaves and 1 develop machine (coding in Intellij Idea 14). Within the spark shell, everything seems working fine and I am able to store data into default database in Hive via Spark 1.5 using DataFrame.write.insertInto("testtable") However when creating a scala project in IDEA and run it using same cluster with same setting, Error was thrown when

Error creating transactional connection factory during running Spark on Hive project in IDEA

阅读更多关于 Error creating transactional connection factory during running Spark on Hive project in IDEA