spark-streaming

ML model update in spark streaming

戏子无情 提交于 2019-12-11 00:56:37
问题 I have persisted machine learning model in hdfs via spark batch job and i am consuming this in my spark streaming. Basically, the ML model is broadcasted to all executors from the spark driver. Can some one suggest how i can update the model in real time without stopping the spark streaming job? Basically a new ML model will get created as and when more data points are available but not have any idea how the NEW model will need to be sent to the spark executors. Request to post some sample

Filter partial duplicates with mapWithState Spark Streaming

风格不统一 提交于 2019-12-11 00:27:50
问题 We have a DStream, such as val ssc = new StreamingContext(sc, Seconds(1)) val kS = KafkaUtils.createDirectStream[String, TMapRecord]( ssc, PreferConsistent, Subscribe[String, TMapRecord](topicsSetT, kafkaParamsInT)). mapPartitions(part => { part.map(_.value()) }). mapPartitions(part1 => { part1.map(c => { TMsg(1, c.field1, c.field2, //And others c.startTimeSeconds ) }) }) So each RDD has a bunch of TMsg objects with some of the (technical) key fields I can use to dediplicate DStream.

Is it possible to remove files from Spark Streaming folder?

别等时光非礼了梦想. 提交于 2019-12-10 23:58:59
问题 Spark 2.1, ETL process convert files from source systems into parquet and put small parquets in folder1. Spark streaming on folder1 is working OK, but parquet files in folder1 too small for HDFS. We have to merge small parquet files in bigger one, but when I try to remove files from folder1, spark streaming process rise exception: 17/07/26 17:16:23 ERROR StreamExecution: Query [id = f29783ea-bdfb-4b59-a6f6-b77f79509a5a, runId = cbcce2b2-7d7b-4e31-a15a-7efed420f974] terminated with error java

How to add jar using HiveContext in the spark job

孤街醉人 提交于 2019-12-10 23:46:41
问题 I am trying to add JSONSerDe jar file to in order to access the json data load the JSON data to hive table from the spark job. My code is shown below: SparkConf sparkConf = new SparkConf().setAppName("KafkaStreamToHbase"); JavaSparkContext sc = new JavaSparkContext(sparkConf); JavaStreamingContext jssc = new JavaStreamingContext(sc, Durations.seconds(10)); final SQLContext sqlContext = new SQLContext(sc); final HiveContext hiveContext = new HiveContext(sc); hiveContext.sql("ADD JAR hdfs:/

Issue on Spark Streaming data put data into HBase

岁酱吖の 提交于 2019-12-10 23:13:17
问题 I am a beginner in this field, so I can not get a sense of it... HBase ver: 0.98.24-hadoop2 Spark ver: 2.1.0 The following code try to put receiving data from Spark Streming-Kafka producer into HBase. Kafka input data format is like this : Line1,TAG1,123 Line1,TAG2,134 Spark-streaming process split the receiving line by delimiter ',' then put the data into HBase. However, my application met an error when it call the htable.put() method. Can any one help why the below code is throwing error?

Opening two KafkaStreams after each other with different StreamingContext

限于喜欢 提交于 2019-12-10 22:23:54
问题 I am currently trying to implement a two staged process in spark streaming. First I open a kafkaStream, read everything that is already in the topic by using auto.offset.reset=earliest and train my model on it. I use a stream for that as I could not find out how to do it without opening a stream before (Spark - Get earliest and latest offset of Kafka without opening stream). As I have not discovered a way to stop the streams without stopping the whole StreamingContext I stop the context after

why is the first member of the tuple received from kafka using spark direct stream is null

时光总嘲笑我的痴心妄想 提交于 2019-12-10 21:43:17
问题 when reading messages from kafka using KafkaUtils.createDirectStream, the v1._1 member of the Tuple2 is null: KafkaUtils.createDirectStream( streamingContext, String.class, String.class, StringDecoder.class, StringDecoder.class, kafkaParams, topicsSet ).map(new Function<Tuple2<String,String>, String>() { @Override public String call(Tuple2<String, String> v1) throws Exception { System.out.println(v1._1); return null; } }); while the _2 member contains the message itself that was passed to

Error when decoding the Proto Buf messages in Spark Streaming , using scalapb

丶灬走出姿态 提交于 2019-12-10 21:13:29
问题 This is a Spark Streaming app that consumes Kafka messages encoded in Proto Buf . Using scalapb library. I am getting the following error. Please help. > com.google.protobuf.InvalidProtocolBufferException: While parsing a > protocol message, the input ended unexpectedly in the middle of a > field. This could mean either that the input has been truncated or > that an embedded message misreported its own length. at > com.google.protobuf.InvalidProtocolBufferException.truncatedMessage

Spark Streaming application fails with KafkaException: String exceeds the maximum size or with IllegalArgumentException

一世执手 提交于 2019-12-10 19:46:15
问题 TL;DR: My very simple Spark Streaming application fails in the driver with the "KafkaException: String exceeds the maximum size". I see the same exception in the executor but I also found somewhere down the executor's logs an IllegalArgumentException with no other information in it Full problem: I'm using Spark Streaming to read some messages from a Kafka topic. This is what I'm doing: val conf = new SparkConf().setAppName("testName") val streamingContext = new StreamingContext(new

Spark streaming + json4s-jackson dependency problems

我怕爱的太早我们不能终老 提交于 2019-12-10 18:56:05
问题 I am unable to use json4s-Jackson 3.2.11 within my spark 1.4.1 Streaming application. Thinking that it was the existing dependency within the spark-core project that is causing the problem as explained here -> Is it possible to use json4s 3.2.11 with Spark 1.3.0? I have built Spark from source with an adjusted core/pom.xml. I have changed the reference from json4s-jackson_2.10:3.2.10 to 3.2.11, as the 2.10 version does not support extracting to implicit types. I have replaced the source jars