How to save spark streaming data in cassandra

build.sbt Below are the contents included in build.sbt file

val sparkVersion = "1.6.3" 
scalaVersion := "2.10.5" 
resolvers += "Spark Packages Repo" at "https://dl.bintray.com/spark-packages/maven" 
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-streaming-kafka" % sparkVersion)
libraryDependencies +="datastax" % "spark-cassandra-connector" % "1.6.3-s_2.10"
libraryDependencies +="org.apache.spark" %% "spark-sql" % "1.1.0"

Command to initialize shell: The below command is the shell initialization procedure I followed

/usr/hdp/2.6.0.3-8/spark/bin/spark-shell --packages datastax:spark-cassandra-connector:1.6.3-s_2.10 --conf spark.cassandra.connection.host=127.0.0.1 –jars spark-streaming-kafka-assembly_2.10-1.6.3.jar

Note: Here I specified jar specifically because SBT couldn’t fetch the required libraries of spark streaming kafka used at creation of kafkaStream in later sections

Import required libraries:

This section includes libraries to be imported that are used in various cases of the REPL session

import org.apache.spark.SparkConf; import org.apache.spark.streaming.StreamingContext; import org.apache.spark.streaming.Seconds; import org.apache.spark.streaming.kafka.KafkaUtils; import com.datastax.spark.connector._ ; import org.apache.spark.sql.cassandra._ ;

Setting up Spark Streaming Configuration:

Here am configuring configurations required for spark streaming context

val conf = new SparkConf().setMaster("local[*]").setAppName("KafkaReceiver")
conf.set("spark.driver.allowMultipleContexts", "true"); // Required to set this to true because during // shell initialization or starting we a spark context is created with configurations of highlighted
conf.setMaster("local"); // then we are assigning those cofigurations locally

Creation of SparkStreamingContext using above configurations: Using configurations defined above we create a spark streaming context in the below way

val ssc = new StreamingContext(conf, Seconds(1)); // Seconds here describe the interval to fetch

Creating a Kafka stream using above Spark Streaming Context aka SSC: Here ssc is spark streaming context that was created above, “localhost:2181” is ZKquoram "spark-streaming-consumer-group" is consumer group Map("test3" -> 5) is Map(“topic” -> number of partitions )

val kafkaStream = KafkaUtils.createStream(ssc, "localhost:2181","spark-streaming-consumer-group", Map("test3" -> 5)).map(_._2)

Note Values fetched when the kafkaStream object is printed, using kafkaStream.print() are shown in below image

85052,19,960.00,0,2017-08-29 14:52:41,17,VISHAL_GWY01_HT1,26,VISHAL_GTWY17_PRES_01,1,2,4                                                             
85053,19,167.00,0,2017-08-29 14:52:41,17,VISHAL_GWY01_HT1,25,VISHAL_GTWY1_Temp_01,1,2,4                                                              
85054,19,960.00,0,2017-08-29 14:52:41,17,VISHAL_GWY01_HT1,26,VISHAL_GTWY17_PRES_01,1,2,4                                                             
85055,19,167.00,0,2017-08-29 14:52:54,17,VISHAL_GWY01_HT1,25,VISHAL_GTWY1_Temp_01,1,2,4                                                              
85056,19,960.00,0,2017-08-29 14:52:54,17,VISHAL_GWY01_HT1,26,VISHAL_GTWY17_PRES_01,1,2,4                                                             
85057,19,167.00,0,2017-08-29 14:52:55,17,VISHAL_GWY01_HT1,25,VISHAL_GTWY1_Temp_01,1,2,4                                                              
85058,19,960.00,0,2017-08-29 14:52:55,17,VISHAL_GWY01_HT1,26,VISHAL_GTWY17_PRES_01,1,2,4                                                             

17/09/02 18:25:25 INFO JobScheduler: Finished job streaming job 1504376716000 ms.0 from job set of time 1504376716000 ms                             
17/09/02 18:25:25 INFO JobScheduler: Total delay: 9.661 s for time 1504376716000 ms (execution: 0.021 s)                                             
17/09/02 18:25:25 INFO JobScheduler: Starting job streaming job 1504376717000 ms.0 from job set of time 1504376717000 ms

Transforming the kafkaStream and saving in Cassandra:

kafkaStream.foreachRDD( rdd => { 
if (! rdd.isEmpty()) { 
rdd.map( line => { 
val arr = line.split(",");
(arr(0), arr(1), arr(2), arr(3), arr(4), arr(5), arr(6), arr(7), arr(8), arr(9), arr(10), arr(11))
}). saveToCassandra("test", "sensorfeedVals", SomeColumns(
"tableid", "ccid", "paramval", "batVal", "time", "gwid", "gwhName", "snid", "snhName", "snStatus", "sd", "MId")
)
} else {
 println("No records to save")
}
}
)

Start ssc:

Using ssc.start you can start the streaming

Issues am facing here are: 1. Printing of the content of stream is happening only after I enter exit or Ctrl+C 2. Whenever I use ssc.start does it start streaming immediately In REPL? Without giving time to enter ssc.awaitTermination 3. Main issue when I tried to save normally in below procedure ***

val collection = sc.parallelize(Seq(("key3", 3), ("key4", 4)))
collection.saveToCassandra("test", "kv", SomeColumns("key", "value"))

am able to save in Cassandra but whenever am trying to save in Cassandra using the logic shown in Transforming the kafkaStream and saving in Cassandra: I couldn't extract each value from string and save it in respective columns of Cassandra tables!

java.lang.NoClassDefFoundError: Could not initialize class com.datastax.spark.connector.cql.CassandraConnector

Means the classpath has not been correctly setup for your application. Make sure you are using the --packages option when launching your application as is noted in the SCC Docs

For your other issues

You don't need awaitTermination in the REPL because the repl will not instantly quit after starting the streaming context. That call is there for an application which may have no further instructions to prevent the main thread from exiting.

Start will start the streaming immediately.

A line or two lines of code which related to contexts is causing the issue here!

I found the solution when i walked through the topics of context!

Here I was running multiple contexts but they are independent to each other.

I have initialized shell with below command:

So when shell starts A spark context with properties of Datastax connector are initialized.

Later I created some configurations and using those configurations created a spark streaming context. Using this context I have created kafkaStream. This kafkaStream is having only properties of SSC but not SC, so here raised the issue of storing in to cassandra.

I have tried to resolve it in the below and succeeded!

val sc = new SparkContext(new SparkConf().setAppName("Spark-Kafka-Streaming").setMaster("local[*]").set("spark.cassandra.connection.host", "127.0.0.1"))
val ssc = new StreamingContext(sc, Seconds(10))

Thanks everyone who came forward to support! Let me know if any more best possible ways to achieve it!

A very simple approach is to convert a stream as a dataframe for foreachRDD API, convert the RDD to DataFrame and save to cassandra using SparkSQL-Cassandra Datasource API. Below is a simple code snippet where I am saving the Twitter tweets to a Cassandra Table

stream.foreachRDD(rdd => {
  if (rdd.count() > 0) {
    val data = rdd.filter(status => status.getLang.equals("en")).map(status => TweetsClass(status.getId,
      status.getCreatedAt.toGMTString(),
      status.getUser.getLocation,
      status.getText)).toDF()
    //Save the data to Cassandra
    data.write.
      format("org.apache.spark.sql.cassandra").
      options(Map(
        "table" -> "sentiment_tweets",
        "keyspace" -> "My Keyspace",
        "cluster" -> "My Cluster")).mode(SaveMode.Append).save()

  }
})

来源：https://stackoverflow.com/questions/46016586/how-to-save-spark-streaming-data-in-cassandra

标签

scala

apache-spark

cassandra

spark-streaming

spark-cassandra-connector