spark-streaming

TypeError: 'JavaPackage' object is not callable & Spark Streaming's Kafka libraries not found in class path

自闭症网瘾萝莉.ら 提交于 2021-01-29 09:48:01
问题 I use pyspark streaming to read kafka data, but it went wrong: import os from pyspark.streaming.kafka import KafkaUtils from pyspark.streaming import StreamingContext from pyspark import SparkContext os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8:2.0.2 pyspark-shell' sc = SparkContext(appName="test") sc.setLogLevel("WARN") ssc = StreamingContext(sc, 60) kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181", "test-id", {'test': 2}) kafkaStream

How to recover from checkpoint when using python spark direct approach?

一笑奈何 提交于 2021-01-29 07:19:36
问题 After read official docs, i tried using checkpoint with getOrCreate in spark streaming. Some snippets: def get_ssc(): sc = SparkContext("yarn-client") ssc = StreamingContext(sc, 10) # calc every 10s ks = KafkaUtils.createDirectStream( ssc, ['lucky-track'], {"metadata.broker.list": KAFKA_BROKER}) process_data(ks) ssc.checkpoint(CHECKPOINT_DIR) return ssc if __name__ == '__main__': ssc = StreamingContext.getOrCreate(CHECKPOINT_DIR, get_ssc) ssc.start() ssc.awaitTermination() The code works fine

add parent column name as prefix to avoid ambiguity

≯℡__Kan透↙ 提交于 2021-01-28 21:59:16
问题 Check below code. It is generating dataframe with ambiguity if duplicate keys are present . How should we modify the code to add parent column name as prefix to it. Added another column with json data. scala> val df = Seq( (77, "email1", """{"key1":38,"key3":39}""","""{"name":"aaa","age":10}"""), (78, "email2", """{"key1":38,"key4":39}""","""{"name":"bbb","age":20}"""), (178, "email21", """{"key1":"when string","key4":36, "key6":"test", "key10":false }""","""{"name":"ccc","age":30}"""), (179,

add parent column name as prefix to avoid ambiguity

余生颓废 提交于 2021-01-28 21:43:12
问题 Check below code. It is generating dataframe with ambiguity if duplicate keys are present . How should we modify the code to add parent column name as prefix to it. Added another column with json data. scala> val df = Seq( (77, "email1", """{"key1":38,"key3":39}""","""{"name":"aaa","age":10}"""), (78, "email2", """{"key1":38,"key4":39}""","""{"name":"bbb","age":20}"""), (178, "email21", """{"key1":"when string","key4":36, "key6":"test", "key10":false }""","""{"name":"ccc","age":30}"""), (179,

Spark Streaming - java.lang.NoSuchMethodError Error

我的梦境 提交于 2021-01-28 20:00:30
问题 I am trying to access the streaming tweets from Spark Streaming. This is the software configuration. Ubuntu 14.04.2 LTS scala -version Scala code runner version 2.11.7 -- Copyright 2002-2013, LAMP/EPFL spark-submit --version Spark version 1.6.0 Following is the code. object PrintTweets { def main(args: Array[String]) { // Configure Twitter credentials using twitter.txt setupTwitter() // Set up a Spark streaming context named "PrintTweets" that runs locally using // all CPU cores and one

Spark Streaming + Kafka Integration : Support new topic subscriptions without requiring restart of the streaming context

核能气质少年 提交于 2021-01-28 05:16:07
问题 I am using a spark streaming application(spark 2.1) to consume data from kafka(0.10.1) topics. I want to subscribe to new topic without restarting the streaming context . Is there any way to achieve this? I can see a jira ticket in apache spark project for the same (https://issues.apache.org/jira/browse/SPARK-10320),Even though it is closed in 2.0 version, I couldn't find any documentation or example to do this. If any of you are familiar with this, please provide me documentation link or

org.apache.spark.sql.AnalysisException: 'write' can not be called on streaming Dataset/DataFrame

三世轮回 提交于 2021-01-27 14:23:40
问题 I'm trying to write a Spark Structured Streaming (2.3) dataset to ScyllaDB (Cassandra). My code to write the dataset: def saveStreamSinkProvider(ds: Dataset[InvoiceItemKafka]) = { ds .writeStream .format("cassandra.ScyllaSinkProvider") .outputMode(OutputMode.Append) .queryName("KafkaToCassandraStreamSinkProvider") .options( Map( "keyspace" -> namespace, "table" -> StreamProviderTableSink, "checkpointLocation" -> "/tmp/checkpoints" ) ) .start() } My ScyllaDB Streaming Sinks: class

Spark Streaming Exception: java.util.NoSuchElementException: None.get

我怕爱的太早我们不能终老 提交于 2021-01-27 06:33:10
问题 I am writing SparkStreaming data to HDFS by converting it to a dataframe: Code object KafkaSparkHdfs { val sparkConf = new SparkConf().setMaster("local[*]").setAppName("SparkKafka") sparkConf.set("spark.driver.allowMultipleContexts", "true"); val sc = new SparkContext(sparkConf) def main(args: Array[String]): Unit = { val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ val ssc = new StreamingContext(sparkConf, Seconds(20)) val kafkaParams = Map[String,

How to safely restart Airflow and kill a long-running task?

一个人想着一个人 提交于 2021-01-07 06:21:49
问题 I have Airflow is running in Kubernetes using the CeleryExecutor. Airflow submits and monitors Spark jobs using the DatabricksOperator. My streaming Spark jobs have a very long runtime (they run forever unless they fail or are cancelled). When pods for Airflow worker are killed while a streaming job is running, the following happens: Associated task becomes a zombie (running state, but no process with heartbeat) Task is marked as failed when Airflow reaps zombies Spark streaming job continues

How to safely restart Airflow and kill a long-running task?

风流意气都作罢 提交于 2021-01-07 06:20:15
问题 I have Airflow is running in Kubernetes using the CeleryExecutor. Airflow submits and monitors Spark jobs using the DatabricksOperator. My streaming Spark jobs have a very long runtime (they run forever unless they fail or are cancelled). When pods for Airflow worker are killed while a streaming job is running, the following happens: Associated task becomes a zombie (running state, but no process with heartbeat) Task is marked as failed when Airflow reaps zombies Spark streaming job continues