spark-streaming | 易学教程

Can't access kafka.serializer.StringDecoder

阅读更多关于 Can't access kafka.serializer.StringDecoder

问题 I have added the sbt packages fro kafka and spark streaming as follow: "org.apache.spark" % "spark-streaming_2.10" % "1.6.1", "org.apache.spark" % "spark-streaming-kafka_2.10" % "1.6.1" however when I wanna use the kafkadirect streams..I cant access it.. val topics="CCN_TOPIC,GGSN_TOPIC" val topicsSet = topics.split(",").toSet val kafkaParams = Map[String, String]("metadata.broker.list" -> kafkaBrokers) val messages= org.apache.spark.streaming.kafka.KafkaUtils[String, String, kafka.serializer

Updating a global variable periodically in Spark

阅读更多关于 Updating a global variable periodically in Spark

问题 I'm doing something like pattern matching in spark streaming app. What I want is updating a variable like broadcast variable , which however is mutable. Is there a way to do that? Any advice? EDIT Sorry for not being so clear. I am doing some CEP stuff on logs. I need to load the rules from elasticsearch while the spark application is running. And I wanna apply these rules on the worker side (on each RDD). 回答1: The idea here is to write a wrapper over the broadcast variable that gets

Spark Streaming - Best way to Split Input Stream based on filter Param

阅读更多关于 Spark Streaming - Best way to Split Input Stream based on filter Param

问题 I currently try to create some kind of monitoring solution - some data is written to kafka and I read this data with Spark Streaming and process it. For preprocessing the data for machine learning and anomaly detection I would like to split the stream based on some filter Parameters. So far I have learned that DStreams themselves cannot be split into several streams. The problem I am mainly facing is that many algorithms(like KMeans) only take continues data and not discrete data like e.g.

Spark Structured Streaming, multiples queries are not running concurrently

阅读更多关于 Spark Structured Streaming, multiples queries are not running concurrently

问题 I slightly modified example taken from here - https://github.com/apache/spark/blob/v2.2.0/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCount.scala I added seconds writeStream (sink): scala case class MyWriter1() extends ForeachWriter[Row]{ override def open(partitionId: Long, version: Long): Boolean = true override def process(value: Row): Unit = { println(s"custom1 - ${value.get(0)}") } override def close(errorOrNull: Throwable): Unit = true } case

spark-submit classpath issue with --repositories --packages options

阅读更多关于 spark-submit classpath issue with --repositories --packages options

问题 I'm running Spark in a standalone cluster where spark master, worker and submit each run in there own Docker container. When spark-submit my Java App with the --repositories and --packages options I can see that it successfully downloads the apps required dependencies. However the stderr logs on the spark workers web ui reports a java.lang.ClassNotFoundException: kafka.serializer.StringDecoder . This class is available in one of the dependencies downloaded by spark-submit . But doesn't look

How to save spark streaming data in cassandra

阅读更多关于 How to save spark streaming data in cassandra

问题 build.sbt Below are the contents included in build.sbt file val sparkVersion = "1.6.3" scalaVersion := "2.10.5" resolvers += "Spark Packages Repo" at "https://dl.bintray.com/spark-packages/maven" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-streaming" % sparkVersion, "org.apache.spark" %% "spark-streaming-kafka" % sparkVersion) libraryDependencies +="datastax" % "spark-cassandra-connector" % "1.6.3-s_2.10" libraryDependencies +="org.apache.spark" %% "spark-sql" % "1.1.0" Command

Extract the time stamp from kafka messages in spark streaming?

阅读更多关于 Extract the time stamp from kafka messages in spark streaming?

问题 Trying to read from kafka source. I want to extract timestamp from message received to do structured spark streaming. kafka(version 0.10.0.0) spark streaming(version 2.0.1) 回答1: I'd suggest couple things: Suppose you create a stream via latest Kafka Streaming Api (0.10 Kafka) E.g. you use dependency: "org.apache.spark" %% "spark-streaming-kafka-0-10" % 2.0.1 Than you create a stream, according to the docs above: val kafkaParams = Map[String, Object]( "bootstrap.servers" -> "broker1:9092

Spark Structured Streaming - Read file from Nested Directories

阅读更多关于 Spark Structured Streaming - Read file from Nested Directories

问题 I have a client which places the CSV files in Nested Directories as below, I need to read these files in real-time. I am trying to do this using Spark Structured Streaming. Data: /user/data/1.csv /user/data/2.csv /user/data/3.csv /user/data/sub1/1_1.csv /user/data/sub1/1_2.csv /user/data/sub1/sub2/2_1.csv /user/data/sub1/sub2/2_2.csv Code: val csvDF = spark .readStream .option("sep", ",") .schema(userSchema) // Schema of the csv files .csv("/user/data/") Any configurations to be added to

Spark Streaming - Calculating stats from key-value pairs grouped by keys

阅读更多关于 Spark Streaming - Calculating stats from key-value pairs grouped by keys

问题 Background: I'm using Spark Streaming to stream events from Kafka which are in the form of comma separated key value pairs Here is an example of how events are streamed into my spark application. Key1=Value1, Key2=Value2, Key3=Value3, Key4=Value4,responseTime=200 Key1=Value5, Key2=Value6, Key3=Value7, Key4=Value8,responseTime=150 Key1=Value9, Key2=Value10, Key3=Value11, Key4=Value12,responseTime=100 Output : I want to calculate different metrics (avg, count etc.) grouped by different keys in

Spark Structured Streaming File Source Starting Offset

阅读更多关于 Spark Structured Streaming File Source Starting Offset

问题 Is there a way how to specify starting offset for Spark Structured File Stream Source ? I am trying to stream parquets from HDFS: spark.sql("SET spark.sql.streaming.schemaInference=true") spark.readStream .parquet("/tmp/streaming/") .writeStream .option("checkpointLocation", "/tmp/streaming-test/checkpoint") .format("parquet") .option("path", "/tmp/parquet-sink") .trigger(Trigger.ProcessingTime(1.minutes)) .start() As I see, the first run is processing all available files detected in path,