Spark with Cassandra input/output

。_饼干妹妹 提交于 2019-11-30 07:28:18

The easiest way to work with Cassandra and Spark is to use the official open source Cassandra driver for Spark developed by DataStax: https://github.com/datastax/spark-cassandra-connector

This driver has been built on top of Cassandra Java Driver and provides a direct bridge between Cassandra and Spark. Unlike Calliope, it does not use the Hadoop interface. Additionally it offers the following unique features:

  • support for all Cassandra data types, including collections, out of the box
  • lightweight mapping of Cassandra rows to custom classes or tuples without the need to use any implicits or other advanced features in Scala
  • saving any RDDs to Cassandra
  • full support for Cassandra Virtual Nodes
  • ability to filter / select on the server side, e.g. leveraging Cassandra clustering columns or secondary indexes

The approach in the code above is a classical centralized algorithm that would work only if executed in one node. Both Cassandra and Spark are distributed systems and therefore it's necessary to model the process in such a way that it can be distributed among a number of nodes.

There are few approaches possible: If you know the keys of the rows to fetch, you could do something simple like this: (using the DataStax Java Driver)

val data = sparkContext.parallelize(keys).map{key => 
   val cluster = val cluster = Cluster.builder.addContactPoint(host).build()
   val session  = cluster.connect(keyspace)
   val statement = session.prepare("...cql...);")
   val boundStatement = new BoundStatement(sttmt)
   session.execute(session.execute(boundStatement.bind(...data...)
}

This will effectively distribute the fetching of keys across the Spark Cluster. Note how the connection to C* is done within the closure as this ensures that the connection is established when the task is executed on each separate, distributed worker.

Given that your example uses a wildcard (i.e. the keys are not known), using the Hadoop interface of Cassandra is a good option. The Spark-Cassandra example linked in the question illustrates the use of this Hadoop interface on Cassandra.

Calliope is a library that encapsulates the complexity of using the Hadoop interface, by providing a simple API to access that functionality. It's available only in Scala, as it uses specific Scala features (like implicits and macros in the upcoming release) With Calliope, you basically declare how to convert your RDD[type] into a row key and row value, and Calliope takes care of configuring the hadoop interfaces to to the job. We have found that Calliope (and the underlying hadoop interfaces) are 2-4x faster than using a driver to interact with Cassandra.

Conclusion: I'd walk away from the Spring-Data configuration to access Cassandra, as this will limit you to a single node. Consider a simple parallelized access if possible or explore using Calliope in Scala.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!