spark-streaming

Using Java 8 parallelStream inside Spark mapParitions

﹥>﹥吖頭↗ 提交于 2019-12-10 16:51:45
问题 I am trying to understand the behavior of Java 8 parallel stream inside spark parallelism. When I run the below code, I am expecting the output size of listOfThings to be the same as input size. But that's not the case, I sometimes have missing items in my output. This behavior is not consistent. If I just iterate through the iterator instead of using parallelStream , everything is fine. Count matches every time. // listRDD.count = 10 JavaRDD test = listRDD.mapPartitions(iterator -> { List

Can I use log4j2.xml in my Apache Spark application

牧云@^-^@ 提交于 2019-12-10 16:44:11
问题 We are trying to integrate log4j2.xml instead of log4j.properties in Apache Spark application, We integrated log4j2.xml but, the problem is unable to write the worker log of the application and there is no problem for writing driver log. Can any one suggest how to integrate log4j2.xml in Apache Spark application with successful writing of both worker and driver log. Thanks in advance.., 来源: https://stackoverflow.com/questions/37966044/can-i-use-log4j2-xml-in-my-apache-spark-application

How to evaluate spark Dstream objects with an spark data frame

百般思念 提交于 2019-12-10 14:34:55
问题 I am writing an spark app ,where I need to evaluate the streaming data based on the historical data, which sits in a sql server database Now the idea is , spark will fetch the historical data from the database and persist it in the memory and will evaluate the streaming data against it . Now I am getting the streaming data as import re from pyspark import SparkContext from pyspark.streaming import StreamingContext from pyspark.sql import SQLContext,functions as func,Row sc = SparkContext(

Does caching in spark streaming increase performance

坚强是说给别人听的谎言 提交于 2019-12-10 14:28:32
问题 So i'm preforming multiple operations on the same rdd in a kafka stream. Is caching that RDD going to improve performance? 回答1: When running multiple operations on the same dstream, cache will substantially improve performance. This can be observed on the Spark UI: Without the use of cache , each iteration on the dstream will take the same time, so the total time to process the data in each batch interval will be linear to the number of iterations on the data: When cache is used, the first

Combining Spark Streaming + MLlib

家住魔仙堡 提交于 2019-12-10 13:56:38
问题 I've tried to use a Random Forest model in order to predict a stream of examples, but it appears that I cannot use that model to classify the examples. Here is the code used in pyspark: sc = SparkContext(appName="App") model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={}, impurity='gini', numTrees=150) ssc = StreamingContext(sc, 1) lines = ssc.socketTextStream(hostname, int(port)) parsedLines = lines.map(parse) parsedLines.pprint() predictions =

Spark checkpointing error when joining static dataset with DStream

馋奶兔 提交于 2019-12-10 13:22:06
问题 I am trying to use Spark Streaming application in Java. My Spark application reads continuous feed from Hadoop directory using textFileStream() at interval of each 1 Min. I need to perform Spark aggregation(group by) operation on incoming DStream. After aggregation, I am joining aggregated DStream<Key, Value1> with RDD<Key, Value2> with RDD<Key, Value2> created from static dataset read by textFile() from hadoop directory. Problem comes when I enable checkpointing. With empty checkpoint

Kafka Spark Stream throws Exception:No current assignment for partition

我们两清 提交于 2019-12-10 12:12:50
问题 Below is my scala code to create spark kafka stream: val kafkaParams = Map[String, Object]( "bootstrap.servers" -> "server110:2181,server110:9092", "zookeeper" -> "server110:2181", "key.deserializer" -> classOf[StringDeserializer], "value.deserializer" -> classOf[StringDeserializer], "group.id" -> "example", "auto.offset.reset" -> "latest", "enable.auto.commit" -> (false: java.lang.Boolean) ) val topics = Array("ABTest") val stream = KafkaUtils.createDirectStream[String, String]( ssc,

Merging micro batches in Spark Streaming

佐手、 提交于 2019-12-10 11:00:03
问题 (I have little knowledge about batch spark, but none on spark streaming) Problem I have a kafka topics Kafka[(A,B)->X] where (A,B) is the key (A and B are simple numeric types) and X is the message type, relatively big (couple of Mb). Putting aside the problem of failure in input, the data is a grid: for each a in A , there will be messages (a,b) for all b in B . Moreover, the b's are ordered and I think that we can assume that all messages for one a will arrive following the b's order (what

Spark streaming tab disappears after restarting from checkpoint

喜欢而已 提交于 2019-12-10 10:59:00
问题 I have a Spark Streaming job running on a cluster (Spark 1.6) which checkpoints to S3. When I start up the job initially, I can see "Streaming" tab. However when I restart the job from checkpoint the Streaming tab disappears. The job still works as a streaming job and I see the batches appear at the configured batch interval. See below. If I clear out the checkpoint data, the tab comes back. I suspect that the Streaming tab is not registered correctly while restarting from a checkpoint. I

Spark mesos cluster mode is slower than local mode

℡╲_俬逩灬. 提交于 2019-12-10 10:56:32
问题 I submit the same jar to run by using both local mode and mesos cluster mode. And found for some exactly same stages, local mode only takes several milliseconds to finish however cluster mode will take seconds! listed is one example: stage 659 local mode: 659 Streaming job from [output operation 1, batch time 17:45:50] map at KafkaHelper.scala:35 +details 2016/03/22 17:46:31 11 ms mesos cluster mode: 659 Streaming job from [output operation 1, batch time 18:01:20] map at KafkaHelper.scala:35