spark-streaming | 易学教程

Using Java 8 parallelStream inside Spark mapParitions

阅读更多关于 Using Java 8 parallelStream inside Spark mapParitions

问题 I am trying to understand the behavior of Java 8 parallel stream inside spark parallelism. When I run the below code, I am expecting the output size of listOfThings to be the same as input size. But that's not the case, I sometimes have missing items in my output. This behavior is not consistent. If I just iterate through the iterator instead of using parallelStream , everything is fine. Count matches every time. // listRDD.count = 10 JavaRDD test = listRDD.mapPartitions(iterator -> { List

Can I use log4j2.xml in my Apache Spark application

阅读更多关于 Can I use log4j2.xml in my Apache Spark application

问题 We are trying to integrate log4j2.xml instead of log4j.properties in Apache Spark application, We integrated log4j2.xml but, the problem is unable to write the worker log of the application and there is no problem for writing driver log. Can any one suggest how to integrate log4j2.xml in Apache Spark application with successful writing of both worker and driver log. Thanks in advance.., 来源： https://stackoverflow.com/questions/37966044/can-i-use-log4j2-xml-in-my-apache-spark-application

How to evaluate spark Dstream objects with an spark data frame

阅读更多关于 How to evaluate spark Dstream objects with an spark data frame

问题 I am writing an spark app ,where I need to evaluate the streaming data based on the historical data, which sits in a sql server database Now the idea is , spark will fetch the historical data from the database and persist it in the memory and will evaluate the streaming data against it . Now I am getting the streaming data as import re from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.sql import SQLContext,functions as func,Row sc = SparkContext(

Does caching in spark streaming increase performance

阅读更多关于 Does caching in spark streaming increase performance

问题 So i'm preforming multiple operations on the same rdd in a kafka stream. Is caching that RDD going to improve performance? 回答1: When running multiple operations on the same dstream, cache will substantially improve performance. This can be observed on the Spark UI: Without the use of cache , each iteration on the dstream will take the same time, so the total time to process the data in each batch interval will be linear to the number of iterations on the data: When cache is used, the first

Combining Spark Streaming + MLlib

阅读更多关于 Combining Spark Streaming + MLlib

问题 I've tried to use a Random Forest model in order to predict a stream of examples, but it appears that I cannot use that model to classify the examples. Here is the code used in pyspark: sc = SparkContext(appName="App") model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={}, impurity='gini', numTrees=150) ssc = StreamingContext(sc, 1) lines = ssc.socketTextStream(hostname, int(port)) parsedLines = lines.map(parse) parsedLines.pprint() predictions =

Spark checkpointing error when joining static dataset with DStream

阅读更多关于 Spark checkpointing error when joining static dataset with DStream

问题 I am trying to use Spark Streaming application in Java. My Spark application reads continuous feed from Hadoop directory using textFileStream() at interval of each 1 Min. I need to perform Spark aggregation(group by) operation on incoming DStream. After aggregation, I am joining aggregated DStream<Key, Value1> with RDD<Key, Value2> with RDD<Key, Value2> created from static dataset read by textFile() from hadoop directory. Problem comes when I enable checkpointing. With empty checkpoint

Kafka Spark Stream throws Exception:No current assignment for partition

阅读更多关于 Kafka Spark Stream throws Exception:No current assignment for partition

问题 Below is my scala code to create spark kafka stream: val kafkaParams = Map[String, Object]( "bootstrap.servers" -> "server110:2181,server110:9092", "zookeeper" -> "server110:2181", "key.deserializer" -> classOf[StringDeserializer], "value.deserializer" -> classOf[StringDeserializer], "group.id" -> "example", "auto.offset.reset" -> "latest", "enable.auto.commit" -> (false: java.lang.Boolean) ) val topics = Array("ABTest") val stream = KafkaUtils.createDirectStream[String, String]( ssc,

Merging micro batches in Spark Streaming

阅读更多关于 Merging micro batches in Spark Streaming

问题 (I have little knowledge about batch spark, but none on spark streaming) Problem I have a kafka topics Kafka[(A,B)->X] where (A,B) is the key (A and B are simple numeric types) and X is the message type, relatively big (couple of Mb). Putting aside the problem of failure in input, the data is a grid: for each a in A , there will be messages (a,b) for all b in B . Moreover, the b's are ordered and I think that we can assume that all messages for one a will arrive following the b's order (what

Spark streaming tab disappears after restarting from checkpoint

阅读更多关于 Spark streaming tab disappears after restarting from checkpoint

问题 I have a Spark Streaming job running on a cluster (Spark 1.6) which checkpoints to S3. When I start up the job initially, I can see "Streaming" tab. However when I restart the job from checkpoint the Streaming tab disappears. The job still works as a streaming job and I see the batches appear at the configured batch interval. See below. If I clear out the checkpoint data, the tab comes back. I suspect that the Streaming tab is not registered correctly while restarting from a checkpoint. I

Spark mesos cluster mode is slower than local mode

阅读更多关于 Spark mesos cluster mode is slower than local mode

问题 I submit the same jar to run by using both local mode and mesos cluster mode. And found for some exactly same stages, local mode only takes several milliseconds to finish however cluster mode will take seconds! listed is one example: stage 659 local mode: 659 Streaming job from [output operation 1, batch time 17:45:50] map at KafkaHelper.scala:35 +details 2016/03/22 17:46:31 11 ms mesos cluster mode: 659 Streaming job from [output operation 1, batch time 18:01:20] map at KafkaHelper.scala:35