spark-streaming | 易学教程

Spark Scala UDP receive on listening port

阅读更多关于 Spark Scala UDP receive on listening port

问题 The example mentioned in http://spark.apache.org/docs/latest/streaming-programming-guide.html Lets me receive data packets in a TCP stream and listening on port 9999 import org.apache.spark._ import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ // not necessary since Spark 1.3 // Create a local StreamingContext with two working thread and batch interval of 1 second. // The master requires 2 cores to prevent from a starvation scenario. val conf = new

Spark Streaming: long queued/active batches

阅读更多关于 Spark Streaming: long queued/active batches

问题 could anyone please point out what's the cause of this active batches hanging there for many weeks and never being processed? Thanks a lot. My guess is not enough executors, and more workers/executors will solve the problem? Or Spark assign priority on different batches within its task scheduler? But the situation here is, very recent batches (end of June) got processed successfully, but batches in May still being queued. I just checked my Spark setting, scheduler policy is FIFO spark

How to read json data using scala from kafka topic in apache spark

阅读更多关于 How to read json data using scala from kafka topic in apache spark

问题 I am new spark, Could you please let me know how to read json data using scala from kafka topic in apache spark. Thanks. 回答1: The simplest method would be to make use of the DataFrame abstraction shipped with Spark. val sqlContext = new SQLContext(sc) val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]( ssc, kafkaParams, Set("myTopicName")) stream.foreachRDD( rdd => { val dataFrame = sqlContext.read.json(rdd.map(_._2)) //converts json to DF //do your

How to create Spark RDD from an iterator?

阅读更多关于 How to create Spark RDD from an iterator?

问题 To make it clear, I am not looking for RDD from an array/list like List<Integer> list = Arrays.asList(1, 2, 3, 4, 5, 6, 7); // sample JavaRDD<Integer> rdd = new JavaSparkContext().parallelize(list); How can I create a spark RDD from a java iterator without completely buffering it in memory? Iterator<Integer> iterator = Arrays.asList(1, 2, 3, 4).iterator(); //sample iterator for illustration JavaRDD<Integer> rdd = new JavaSparkContext().what("?", iterator); //the Question Additional Question:

Spark Streaming kafka offset manage

阅读更多关于 Spark Streaming kafka offset manage

问题 I had been doing spark streaming jobs which consumer and produce data through kafka. I used directDstream,so I had to manage offset by myself,we adopted redis to write and read offsets.Now there is one problem，when I launched my client,my client need to get the offset from redis,not offset which exists in kafka itself.how show I write my code?Now I had written my code below: kafka_stream = KafkaUtils.createDirectStream( ssc, topics=[config.CONSUME_TOPIC, ], kafkaParams={"bootstrap.servers":

Spark Streaming connection pool in each JVM

阅读更多关于 Spark Streaming connection pool in each JVM

问题 In my spark streaming app, I have many I/O operations, such as codis, hbase, etc. I want to make sure exactly one connection pool in each executor, how can I do this elegantly? Now, I implement some static class dispersedly, this is not good for management. How about centralize them into one class like xxContext, some what like SparkContext, and need I broadcast it? I know it's good to broadcast large read-only dataset, but how about these connection pools? Java or scala are both acceptable.

Unable to serialize SparkContext in foreachRDD

阅读更多关于 Unable to serialize SparkContext in foreachRDD

问题 I am trying to save the streaming data to cassandra from Kafka. I am able to read and parse the data but when I call below lines to save the data i am getting a Task not Serializable Exception. My class is extending serializable but not sure why i am seeing this error, didn't get much help ever after googling for 3 hours, can some body give any pointers ? val collection = sc.parallelize(Seq((obj.id, obj.data))) collection.saveToCassandra("testKS", "testTable ", SomeColumns("id", "data"))`

exporting spark worker/executor metrics to prometheus using jmxagent

阅读更多关于 exporting spark worker/executor metrics to prometheus using jmxagent

问题 I have followed the instructions here to enable the metrics export to Prometheus for spark. In order to enable metrics export not just from the job, but also from master and workers, I have enabled the jmx agent for all of spark driver, master, worker, and executor. This causes a problem since spark worker and executor are collocated on the same machine and, thus, I need to pass in different jmx ports to them. This is not a problem if I have a 1-1 relationship between spark workers and

How to refresh a table and do it concurrently?

阅读更多关于 How to refresh a table and do it concurrently?

问题 I'm using Spark Streaming 2.1. I'd like to refresh some cached table (loaded by spark provided DataSource like parquet, MySQL or user-defined data sources) periodically. how to refresh the table? Suppose I have some table loaded by spark.read.format("").load().createTempView("my_table") and it is also cached by spark.sql("cache table my_table") is it enough with following code to refresh the table, and when the table is loaded next, it will automatically be cached spark.sql("refresh table my

Spark structured streaming consistency across sinks

阅读更多关于 Spark structured streaming consistency across sinks

问题 I'd like to understand better the consistency model of Spark 2.2 structured streaming in the following case : one source (Kinesis) 2 queries from this source towards 2 different sinks : one file sink for archive purpose (S3), and another sink for processed data (DB or file, not yet decided) I'd like to understand if there's any consistency guarantee across sinks, at least under certain circumstances : Can one of the sink be way ahead of the other ? Or are they consuming data at the same speed