spark-streaming

SparkStreaming: error in fileStream()

若如初见. 提交于 2019-12-07 16:03:57
问题 I am trying to implement spark streaming application in scala. I want to use fileStream() method to process newly arrived files as well as older files present in hadoop directory. I have followed fileStream() implementation from following two threads from stackoverflow as: Scala Spark streaming fileStream spark streaming fileStream I am using fileStream() as following: val linesRDD = ssc.fileStream[LongWritable, Text, TextInputFormat](inputDirectory, (t: org.apache.hadoop.fs.Path) => true,

Spark: How to split an RDD[T]` into Seq[RDD[T]] and preserve the ordering

爷,独闯天下 提交于 2019-12-07 13:01:24
问题 How can I effectively split up an RDD[T] into a Seq[RDD[T]] / Iterable[RDD[T]] with n elements and preserve the original ordering? I would like to be able to write something like this RDD(1, 2, 3, 4, 5, 6, 7, 8, 9).split(3) which should result in something like Seq(RDD(1, 2, 3), RDD(4, 5, 6), RDD(7, 8, 9)) Does spark provide such a function? If not what is a performant way to achieve this? val parts = rdd.length / n val rdds = rdd.zipWithIndex().map{ case (t, i) => (i - (i % parts), t)}

Yarn : Automatic clearing of filecache & usercache

╄→гoц情女王★ 提交于 2019-12-07 12:06:39
问题 We are running a spark streaming job with yarn as resource manager, noticing that these two directories are getting filled up on the data nodes and we are running out of space when we only run for couple of min's /tmp/hadoop/data/nm-local-dir/filecache /tmp/hadoop/data/nm-local-dir/filecache these directories are not getting cleared automatically , from my research found that this property need's to be set, yarn.nodemanager.localizer.cache.cleanup.interval-ms Even after setting this up ..it's

How to convert streaming Dataset to DStream?

微笑、不失礼 提交于 2019-12-07 12:03:22
问题 Is it possible to convert a streaming o.a.s.sql.Dataset to DStream ? If so, how? I know how to convert it to RDD, but it is in a streaming context. 回答1: It is not possible. Structured Streaming and legacy Spark Streaming ( DStreams ) use completely different semantics and are not compatible with each other so: DStream cannot be converted to Streaming Dataset . Streaming Dataset cannot be converted to DStream . 回答2: It could be possible (in some use cases). That question really begs another:

Spark streaming mapWithState timeout delayed?

北城余情 提交于 2019-12-07 09:23:08
问题 I expected the new mapWithState API for Spark 1.6+ to near-immediately remove objects that are timed-out, but there is a delay. I'm testing the API with the adapted version of the JavaStatefulNetworkWordCount below: SparkConf sparkConf = new SparkConf() .setAppName("JavaStatefulNetworkWordCount") .setMaster("local[*]"); JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(1)); ssc.checkpoint("./tmp"); StateSpec<String, Integer, Integer, Tuple2<String, Integer>>

How to write rows asynchronously in Spark Streaming application to speed up batch execution?

爷,独闯天下 提交于 2019-12-07 09:21:17
问题 I have a spark job where I need to write the output of the SQL query every micro-batch. Write is a expensive operation perf wise and is causing the batch execution time to exceed the batch interval. I am looking for ways to improve the performance of write. Is doing the write action in a separate thread asynchronously like shown below a good option? Would this cause any side effects because Spark itself executes in a distributed manner? Are there other/better ways of speeding up the write? //

Spark clean up shuffle spilled to disk

纵饮孤独 提交于 2019-12-07 09:18:35
问题 I have a looping operation which generates some RDDs, does repartition, then a aggregatebykey operation. After the loop runs onces, it computes a final RDD, which is cached and checkpointed, and also used as the initial RDD for the next loop. These RDDs are quite large and generate lots of intermediate shuffle blocks before arriving a the final RDD for every iteration. I am compressing my shuffles and allowing shuffles to spill to disk. I notice on my worker machines that my working directory

How to give dependent jars to spark submit in cluster mode

与世无争的帅哥 提交于 2019-12-07 08:15:33
问题 I am running spark using cluster mode for deployment . Below is the command JARS=$JARS_HOME/amqp-client-3.5.3.jar,$JARS_HOME/nscala-time_2.10-2.0.0.jar,\ $JARS_HOME/rabbitmq-0.1.0-RELEASE.jar,\ $JARS_HOME/kafka_2.10-0.8.2.1.jar,$JARS_HOME/kafka-clients-0.8.2.1.jar,\ $JARS_HOME/spark-streaming-kafka_2.10-1.4.1.jar,\ $JARS_HOME/zkclient-0.3.jar,$JARS_HOME/protobuf-java-2.4.0a.jar dse spark-submit -v --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \ --executor-memory 512M \

How to avoid query preparation (parsing, planning and optimizations) every time the query is executed?

可紊 提交于 2019-12-07 07:03:34
问题 In our Spark streaming app, with 60 second batches, we create a temp table over a DF, then run about 80 queries against it like: sparkSession.sql("select ... from temp_view group by ...") but given that these are fairly heavy queries with about 300 summed columns, it would be nice if we didn't have to analyze the sql and generate a query plan with every microbatch. Isn't there a way to generate, cache and reuse a query plan? even saving just 50ms per query would save us about 4s per batch. We

How do you setup multiple Spark Streaming jobs with different batch durations?

邮差的信 提交于 2019-12-07 06:30:28
问题 We are in the beginning phases of transforming the current data architecture of a large enterprise and I am currently building a Spark Streaming ETL framework in which we would connect all of our sources to destinations (source/destinations could be Kafka topics, Flume, HDFS, etc.) through transformations. This would look something like: SparkStreamingEtlManager.addEtl(Source, Transformation*, Destination) SparkStreamingEtlManager.streamEtl() streamingContext.start() The assumptions is that,