spark-cassandra-connector

Cannot connect to Cassandra from Spark (Contact points contain multiple data centers)

阅读更多关于 Cannot connect to Cassandra from Spark (Contact points contain multiple data centers)

问题 I am trying to run my first spark job (a Scala job that accesses Cassandra) which is failing and showing the following error : java.io.IOException: Failed to open native connection to Cassandra at {<ip>}:9042 at com.datastax.spark.connector.cql.CassandraConnector$.com$datastax$spark$connector$cql$CassandraConnector$$createSession(CassandraConnector.scala:164) at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$2.apply(CassandraConnector.scala:150) at com.datastax.spark.connector

java.lang.ClassNotFoundException: com.datastax.spark.connector.rdd.partitioner.CassandraPartition

阅读更多关于 java.lang.ClassNotFoundException: com.datastax.spark.connector.rdd.partitioner.CassandraPartition

问题 I've been working with Cassandra for a little while and now I'm trying to setup spark and spark-cassandra-connector. I'm using IntelliJ IDEA to do that (first time with IntelliJ IDEA and Scala too) in Windows 10. build.gradle apply plugin: 'scala' apply plugin: 'idea' apply plugin: 'eclipse' repositories { mavenCentral() flatDir { dirs 'runtime libs' } } idea { project { jdkName = '1.8' languageLevel = '1.8' } } dependencies { compile group: 'org.apache.spark', name: 'spark-core_2.11',

Spark Cassandra Connector Java API append/remove data in a collection fail

阅读更多关于 Spark Cassandra Connector Java API append/remove data in a collection fail

问题 I am trying to append values to a column of type set, via the JAVA API. It seems that the connector disregards the type of CollectionBehavior I am setting, and always overrides the previous collection. Even when I use CollectionRemove, the value to be removed is added to the collection. I am following the example as shown in: https://datastax-oss.atlassian.net/browse/SPARKC-340?page=com.atlassian.jira.plugin.system.issuetabpanels%3Achangehistory-tabpanel I am using: spark-core_2.11 2.2.0

Remove Duplicates without shuffle Spark

阅读更多关于 Remove Duplicates without shuffle Spark

问题 I have a Cassandra table XYX with columns( id uuid, insert a timestamp, header text) Where id and insert are composite primary key. I'm using Dataframe and in my spark shell I'm fetching id and header column. I want to have distinct rows based on id and header column. I'm seeing lot of shuffles which not be the case since Spark Cassandra connector ensures that all rows for a given Cassandra partition are in same spark partition. After fetching I'm using dropDuplicates to get distinct records.

Is there an alternative to joinWithCassandraTable for DataFrames in Spark (Scala) when retrieving data from only certain Cassandra partitions?

阅读更多关于 Is there an alternative to joinWithCassandraTable for DataFrames in Spark (Scala) when retrieving data from only certain Cassandra partitions?

问题 When extracting small number of partitions from large C* table using RDDs, we can use this: val rdd = … // rdd including partition data val data = rdd.repartitionByCassandraReplica(keyspace, tableName) .joinWithCassandraTable(keyspace, tableName) Do we have available an equally effective approach using DataFrames? Update (Apr 26, 2017): To be more concrete, I prepared an example. I have 2 tables in Cassandra: CREATE TABLE ids ( id text, registered timestamp, PRIMARY KEY (id) ) CREATE TABLE

NoSuchMethodError from spark-cassandra-connector with assembled jar

阅读更多关于 NoSuchMethodError from spark-cassandra-connector with assembled jar

问题 I'm fairly new to Scala and am trying to build a Spark job. I've built ajob that contains the DataStax connector and assembled it into a fat jar. When I try to execute it it fails with a java.lang.NoSuchMethodError . I've cracked open the JAR and can see that the DataStax library is included. Am I missing something obvious? Is there a good tutorial to look at regarding this process? Thanks console $ spark-submit --class org.bobbrez.CasCountJob ./target/scala-2.11/bobbrez-spark-assembly-0.0.1

Installing cassandra spark connector

阅读更多关于 Installing cassandra spark connector

问题 As per https://github.com/datastax/spark-cassandra-connector http://spark-packages.org/package/datastax/spark-cassandra-connector I did the command but at the end it looks like there are errors. Are these fatal or do I need to resolve them? [idf@node1 bin]$ spark-shell --packages datastax:spark-cassandra-connector:1.6.0-M1-s_2.11 Ivy Default Cache set to: /home/idf/.ivy2/cache The jars for the packages stored in: /home/idf/.ivy2/jars :: loading settings :: url = jar:file:/opt/spark-1.6.1-bin

Cassandra/Spark showing incorrect entries count for large table

阅读更多关于 Cassandra/Spark showing incorrect entries count for large table

问题 I am trying to use spark to process a large cassandra table (~402 million entries and 84 columns) but I am getting inconsistent results. Initially the requirement was to copy some columns from this table to another table. After copying the data, I noticed that some entries in the new table were missing. To verify that I took count of the large source table but I am getting different values each time. I tried the queries on a smaller table (~7 million records) and the results were fine.

How to call DataFrameFunctions.createCassandraTable from Java?

阅读更多关于 How to call DataFrameFunctions.createCassandraTable from Java?

问题 How can I call this function from Java? Or do I need a wrapper in scala? package com.datastax.spark.connector class DataFrameFunctions(dataFrame: DataFrame) extends Serializable { ... def createCassandraTable( keyspaceName: String, tableName: String, partitionKeyColumns: Option[Seq[String]] = None, clusteringKeyColumns: Option[Seq[String]] = None)( implicit connector: CassandraConnector = CassandraConnector(sparkContext.getConf)): Unit = { ... 回答1: I used the following code :

How to handle Spark with multiple cassandra server with different ssl policy

阅读更多关于 How to handle Spark with multiple cassandra server with different ssl policy

问题 One cassandra cluster doesn't have SSL enabled and another cassandra cluster has SSL enabled. How to interact with both the cassandra cluster from a single spark job. I have to copy the table from one server(without SSL) and put into another server(with SSL). Spark job:- object TwoClusterExample extends App { val conf = new SparkConf(true).setAppName("SparkCassandraTwoClusterExample") println("Starting the SparkCassandraLocalJob....") val sc = new SparkContext(conf) val connectorToClusterOne