spark-streaming

Can't access kafka.serializer.StringDecoder

强颜欢笑 提交于 2019-12-08 03:47:16
问题 I have added the sbt packages fro kafka and spark streaming as follow: "org.apache.spark" % "spark-streaming_2.10" % "1.6.1", "org.apache.spark" % "spark-streaming-kafka_2.10" % "1.6.1" however when I wanna use the kafkadirect streams..I cant access it.. val topics="CCN_TOPIC,GGSN_TOPIC" val topicsSet = topics.split(",").toSet val kafkaParams = Map[String, String]("metadata.broker.list" -> kafkaBrokers) val messages= org.apache.spark.streaming.kafka.KafkaUtils[String, String, kafka.serializer

Updating a global variable periodically in Spark

余生长醉 提交于 2019-12-08 03:14:55
问题 I'm doing something like pattern matching in spark streaming app. What I want is updating a variable like broadcast variable , which however is mutable. Is there a way to do that? Any advice? EDIT Sorry for not being so clear. I am doing some CEP stuff on logs. I need to load the rules from elasticsearch while the spark application is running. And I wanna apply these rules on the worker side (on each RDD). 回答1: The idea here is to write a wrapper over the broadcast variable that gets

Spark Streaming - Best way to Split Input Stream based on filter Param

こ雲淡風輕ζ 提交于 2019-12-08 02:38:05
问题 I currently try to create some kind of monitoring solution - some data is written to kafka and I read this data with Spark Streaming and process it. For preprocessing the data for machine learning and anomaly detection I would like to split the stream based on some filter Parameters. So far I have learned that DStreams themselves cannot be split into several streams. The problem I am mainly facing is that many algorithms(like KMeans) only take continues data and not discrete data like e.g.

Spark Structured Streaming, multiples queries are not running concurrently

烂漫一生 提交于 2019-12-08 00:53:53
问题 I slightly modified example taken from here - https://github.com/apache/spark/blob/v2.2.0/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCount.scala I added seconds writeStream (sink): scala case class MyWriter1() extends ForeachWriter[Row]{ override def open(partitionId: Long, version: Long): Boolean = true override def process(value: Row): Unit = { println(s"custom1 - ${value.get(0)}") } override def close(errorOrNull: Throwable): Unit = true } case

spark-submit classpath issue with --repositories --packages options

你说的曾经没有我的故事 提交于 2019-12-08 00:39:11
问题 I'm running Spark in a standalone cluster where spark master, worker and submit each run in there own Docker container. When spark-submit my Java App with the --repositories and --packages options I can see that it successfully downloads the apps required dependencies. However the stderr logs on the spark workers web ui reports a java.lang.ClassNotFoundException: kafka.serializer.StringDecoder . This class is available in one of the dependencies downloaded by spark-submit . But doesn't look

How to save spark streaming data in cassandra

北慕城南 提交于 2019-12-07 22:03:56
问题 build.sbt Below are the contents included in build.sbt file val sparkVersion = "1.6.3" scalaVersion := "2.10.5" resolvers += "Spark Packages Repo" at "https://dl.bintray.com/spark-packages/maven" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-streaming" % sparkVersion, "org.apache.spark" %% "spark-streaming-kafka" % sparkVersion) libraryDependencies +="datastax" % "spark-cassandra-connector" % "1.6.3-s_2.10" libraryDependencies +="org.apache.spark" %% "spark-sql" % "1.1.0" Command

Extract the time stamp from kafka messages in spark streaming?

天涯浪子 提交于 2019-12-07 19:01:25
问题 Trying to read from kafka source. I want to extract timestamp from message received to do structured spark streaming. kafka(version 0.10.0.0) spark streaming(version 2.0.1) 回答1: I'd suggest couple things: Suppose you create a stream via latest Kafka Streaming Api (0.10 Kafka) E.g. you use dependency: "org.apache.spark" %% "spark-streaming-kafka-0-10" % 2.0.1 Than you create a stream, according to the docs above: val kafkaParams = Map[String, Object]( "bootstrap.servers" -> "broker1:9092

Spark Structured Streaming - Read file from Nested Directories

陌路散爱 提交于 2019-12-07 18:38:42
问题 I have a client which places the CSV files in Nested Directories as below, I need to read these files in real-time. I am trying to do this using Spark Structured Streaming. Data: /user/data/1.csv /user/data/2.csv /user/data/3.csv /user/data/sub1/1_1.csv /user/data/sub1/1_2.csv /user/data/sub1/sub2/2_1.csv /user/data/sub1/sub2/2_2.csv Code: val csvDF = spark .readStream .option("sep", ",") .schema(userSchema) // Schema of the csv files .csv("/user/data/") Any configurations to be added to

Spark Streaming - Calculating stats from key-value pairs grouped by keys

痞子三分冷 提交于 2019-12-07 16:47:05
问题 Background: I'm using Spark Streaming to stream events from Kafka which are in the form of comma separated key value pairs Here is an example of how events are streamed into my spark application. Key1=Value1, Key2=Value2, Key3=Value3, Key4=Value4,responseTime=200 Key1=Value5, Key2=Value6, Key3=Value7, Key4=Value8,responseTime=150 Key1=Value9, Key2=Value10, Key3=Value11, Key4=Value12,responseTime=100 Output : I want to calculate different metrics (avg, count etc.) grouped by different keys in

Spark Structured Streaming File Source Starting Offset

╄→гoц情女王★ 提交于 2019-12-07 16:15:24
问题 Is there a way how to specify starting offset for Spark Structured File Stream Source ? I am trying to stream parquets from HDFS: spark.sql("SET spark.sql.streaming.schemaInference=true") spark.readStream .parquet("/tmp/streaming/") .writeStream .option("checkpointLocation", "/tmp/streaming-test/checkpoint") .format("parquet") .option("path", "/tmp/parquet-sink") .trigger(Trigger.ProcessingTime(1.minutes)) .start() As I see, the first run is processing all available files detected in path,