build.sbt Below are the contents included in build.sbt file
val sparkVersion = "1.6.3"
scalaVersion := "2.10.5"
resolvers += "Spark Packages Repo" at "https://dl.bintray.com/spark-packages/maven"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-streaming-kafka" % sparkVersion)
libraryDependencies +="datastax" % "spark-cassandra-connector" % "1.6.3-s_2.10"
libraryDependencies +="org.apache.spark" %% "spark-sql" % "1.1.0"
Command to initialize shell: The below command is the shell initialization procedure I followed
/usr/hdp/2.6.0.3-8/spark/bin/spark-shell --packages datastax:spark-cassandra-connector:1.6.3-s_2.10 --conf spark.cassandra.connection.host=127.0.0.1 –jars spark-streaming-kafka-assembly_2.10-1.6.3.jar
Note: Here I specified jar specifically because SBT couldn’t fetch the required libraries of spark streaming kafka used at creation of kafkaStream in later sections
Import required libraries:
This section includes libraries to be imported that are used in various cases of the REPL session
import org.apache.spark.SparkConf; import org.apache.spark.streaming.StreamingContext; import org.apache.spark.streaming.Seconds; import org.apache.spark.streaming.kafka.KafkaUtils; import com.datastax.spark.connector._ ; import org.apache.spark.sql.cassandra._ ;
Setting up Spark Streaming Configuration:
Here am configuring configurations required for spark streaming context
val conf = new SparkConf().setMaster("local[*]").setAppName("KafkaReceiver")
conf.set("spark.driver.allowMultipleContexts", "true"); // Required to set this to true because during // shell initialization or starting we a spark context is created with configurations of highlighted
conf.setMaster("local"); // then we are assigning those cofigurations locally
Creation of SparkStreamingContext using above configurations: Using configurations defined above we create a spark streaming context in the below way
val ssc = new StreamingContext(conf, Seconds(1)); // Seconds here describe the interval to fetch
Creating a Kafka stream using above Spark Streaming Context aka SSC: Here ssc is spark streaming context that was created above, “localhost:2181” is ZKquoram "spark-streaming-consumer-group" is consumer group Map("test3" -> 5) is Map(“topic” -> number of partitions )
val kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181","spark-streaming-consumer-group", Map("test3" -> 5)).map(_._2)
Note
Values fetched when the kafkaStream object is printed, using kafkaStream.print()
are shown in below image
85052,19,960.00,0,2017-08-29 14:52:41,17,VISHAL_GWY01_HT1,26,VISHAL_GTWY17_PRES_01,1,2,4
85053,19,167.00,0,2017-08-29 14:52:41,17,VISHAL_GWY01_HT1,25,VISHAL_GTWY1_Temp_01,1,2,4
85054,19,960.00,0,2017-08-29 14:52:41,17,VISHAL_GWY01_HT1,26,VISHAL_GTWY17_PRES_01,1,2,4
85055,19,167.00,0,2017-08-29 14:52:54,17,VISHAL_GWY01_HT1,25,VISHAL_GTWY1_Temp_01,1,2,4
85056,19,960.00,0,2017-08-29 14:52:54,17,VISHAL_GWY01_HT1,26,VISHAL_GTWY17_PRES_01,1,2,4
85057,19,167.00,0,2017-08-29 14:52:55,17,VISHAL_GWY01_HT1,25,VISHAL_GTWY1_Temp_01,1,2,4
85058,19,960.00,0,2017-08-29 14:52:55,17,VISHAL_GWY01_HT1,26,VISHAL_GTWY17_PRES_01,1,2,4
17/09/02 18:25:25 INFO JobScheduler: Finished job streaming job 1504376716000 ms.0 from job set of time 1504376716000 ms
17/09/02 18:25:25 INFO JobScheduler: Total delay: 9.661 s for time 1504376716000 ms (execution: 0.021 s)
17/09/02 18:25:25 INFO JobScheduler: Starting job streaming job 1504376717000 ms.0 from job set of time 1504376717000 ms
Transforming the kafkaStream and saving in Cassandra:
kafkaStream.foreachRDD( rdd => {
if (! rdd.isEmpty()) {
rdd.map( line => {
val arr = line.split(",");
(arr(0), arr(1), arr(2), arr(3), arr(4), arr(5), arr(6), arr(7), arr(8), arr(9), arr(10), arr(11))
}). saveToCassandra("test", "sensorfeedVals", SomeColumns(
"tableid", "ccid", "paramval", "batVal", "time", "gwid", "gwhName", "snid", "snhName", "snStatus", "sd", "MId")
)
} else {
println("No records to save")
}
}
)
Start ssc:
Using ssc.start
you can start the streaming
Issues am facing here are: 1. Printing of the content of stream is happening only after I enter exit or Ctrl+C 2. Whenever I use ssc.start does it start streaming immediately In REPL? Without giving time to enter ssc.awaitTermination 3. Main issue when I tried to save normally in below procedure ***
val collection = sc.parallelize(Seq(("key3", 3), ("key4", 4)))
collection.saveToCassandra("test", "kv", SomeColumns("key", "value"))
am able to save in Cassandra but whenever am trying to save in Cassandra using the logic shown in Transforming the kafkaStream and saving in Cassandra: I couldn't extract each value from string and save it in respective columns of Cassandra tables!
java.lang.NoClassDefFoundError: Could not initialize class com.datastax.spark.connector.cql.CassandraConnector
Means the classpath has not been correctly setup for your application. Make sure you are using the --packages
option when launching your application as is noted in the SCC Docs
For your other issues
You don't need awaitTermination
in the REPL because the repl will not instantly quit after starting the streaming context. That call is there for an application which may have no further instructions to prevent the main thread from exiting.
Start will start the streaming immediately.
A line or two lines of code which related to contexts is causing the issue here!
I found the solution when i walked through the topics of context!
Here I was running multiple contexts but they are independent to each other.
I have initialized shell with below command:
/usr/hdp/2.6.0.3-8/spark/bin/spark-shell --packages datastax:spark-cassandra-connector:1.6.3-s_2.10 --conf spark.cassandra.connection.host=127.0.0.1 –jars spark-streaming-kafka-assembly_2.10-1.6.3.jar
So when shell starts A spark context with properties of Datastax connector are initialized.
Later I created some configurations and using those configurations created a spark streaming context. Using this context I have created kafkaStream. This kafkaStream is having only properties of SSC but not SC, so here raised the issue of storing in to cassandra.
I have tried to resolve it in the below and succeeded!
val sc = new SparkContext(new SparkConf().setAppName("Spark-Kafka-Streaming").setMaster("local[*]").set("spark.cassandra.connection.host", "127.0.0.1"))
val ssc = new StreamingContext(sc, Seconds(10))
Thanks everyone who came forward to support! Let me know if any more best possible ways to achieve it!
A very simple approach is to convert a stream as a dataframe for foreachRDD API, convert the RDD to DataFrame and save to cassandra using SparkSQL-Cassandra Datasource API. Below is a simple code snippet where I am saving the Twitter tweets to a Cassandra Table
stream.foreachRDD(rdd => {
if (rdd.count() > 0) {
val data = rdd.filter(status => status.getLang.equals("en")).map(status => TweetsClass(status.getId,
status.getCreatedAt.toGMTString(),
status.getUser.getLocation,
status.getText)).toDF()
//Save the data to Cassandra
data.write.
format("org.apache.spark.sql.cassandra").
options(Map(
"table" -> "sentiment_tweets",
"keyspace" -> "My Keyspace",
"cluster" -> "My Cluster")).mode(SaveMode.Append).save()
}
})
来源:https://stackoverflow.com/questions/46016586/how-to-save-spark-streaming-data-in-cassandra