spark-cassandra-connector

How to Use spark cassandra connector API in scala

爱⌒轻易说出口 提交于 2019-12-11 15:18:51
问题 My previous post: Reparing Prepared stmt warning. i was not able to solve it, with few suggestions, i tried using spark cassandra connector to solve my problem. But i am completely confused about its usage in my application. i tried to write code as below,but not sure how exactly to use the API's. val conf = new SparkConf(true) .set("spark.cassandra.connection.host", "1.1.1.1") .set("spark.cassandra.auth.username", "auser") .set("spark.cassandra.auth.password", "apass") .set("spark.cassandra

Apache spark - Cassandra Guava incompatibility

◇◆丶佛笑我妖孽 提交于 2019-12-11 08:47:04
问题 I am using Apache spark 2.1.0, Apache Spark connector 2.0.0-M3 and Cassandra driver core 3.0.0 I get the following error when I try to execute the program: 17/01/19 10:38:27 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID 5, 10.10.10.51, executor 1): java.lang.NoClassDefFoundError: Could not initialize class com.datastax.driver.core.Cluster at com.datastax.spark.connector.cql.DefaultConnectionFactory$.clusterBuilder(CassandraConnectionFactory.scala:35) at com.datastax.spark.connector.cql

Spark Cassandra NoClassDefFoundError guava/cache/CacheLoader

天涯浪子 提交于 2019-12-11 07:35:49
问题 Running Cassandra 2.2.8, Win7, JDK8, Spark2, HAve thse in the CP: Cassandra core 3.12, spark-cassandra-2.11, Spark-cassandra-java-2.11, Spark2.11, spark-network-common_2.11, Guava-16.0.jar, sacala2.11.jar, etc Trying to run a basic example- compiles fine , but when when I try to run- at the first line itself get error: SparkConf conf = new SparkConf(); java.lang.NoClassDefFoundError: org/spark_project/guava/cache/CacheLoader Missing spark-network-common is supposed to cause this error - but I

Saving stream data into tables cassandra with some names of topics

◇◆丶佛笑我妖孽 提交于 2019-12-11 06:05:46
问题 i have streaming data from few topics kafka, and i want to save each line of RDD into particular table cassandra, my RDD is the collection of case class named Stock : Stock(test1,2017/07/23 00:01:02,14,Status) Stock(test1,2017/07/23 00:01:03,78,Status) Stock(test2,2017/07/23 00:01:02,86,Status) Stock(test2,2017/07/23 00:01:03,69,Status) Stock(test3,2017/07/23 00:01:02,46,Status) Stock(test3,2017/07/23 00:01:03,20,Status) i want to get the first element of each line in this RDD which represent

How to use method bulkSaveToCassandra with spark-cassandra-connector

六月ゝ 毕业季﹏ 提交于 2019-12-11 04:43:58
问题 I'm trying to use the method bulkSaveToCassandra with spark-cassandra-connector to optimize my insertions in Cassandra Database. However, I can't find out the method and I don't know how to import the lib. Currently, I'm using this dependency: <dependency> <groupId>com.datastax.spark</groupId> <artifactId>spark-cassandra-connector_2.11</artifactId> <version>2.0.2</version> </dependency> Below the reference of method bulkSaveToCassandra from Datastax: http://docs.datastax.com/en/datastax

How to insert rows into cassandra if they don't exist using spark- cassandra driver?

可紊 提交于 2019-12-11 04:14:21
问题 I want to write to cassandra from a data frame and I want to exclude the rows if a particular row is already existing (i.e Primary key- though upserts happen I don't want to change the other columns) using spark-cassandra connector. Is there a way we can do that? Thanks.! 回答1: You can use the ifNotExists WriteConf option which was introduced in this pr. It works like so: val writeConf = WriteConf(ifNotExists = true) rdd.saveToCassandra(keyspaceName, tableName, writeConf = writeConf) 回答2: You

Cannot connect to cassandra from Spark

流过昼夜 提交于 2019-12-10 22:37:57
问题 I have some test data in my cassandra. I am trying to fetch this data from spark but I get an error like : py4j.protocol.Py4JJavaError: An error occurred while calling o25.load. java.io.IOException: Failed to open native connection to Cassandra at {127.0.1.1}:9042 This is what I've done till now: started ./bin/cassandra created test data using cql with keyspace ="testkeyspace2" and table="emp" and some keys and corresponding values. Wrote standalone.py Ran the following pyspark shell command.

Comparison between different methods of executing SQL queries on Cassandra Column Families using spark

允我心安 提交于 2019-12-10 15:59:04
问题 As part of my project, I have to create a SQL query interface for a very large Cassandra Dataset, hence I have been looking at different methods for executing SQL queries on cassandra column families using Spark and I have come up with 3 different methods using Spark SQLContext with a statically defined schema // statically defined in the application public static class TableTuple implements Serializable { private int id; private String line; TableTuple (int i, String l) { id = i; line = l; }

Apache Spark fails to process a large Cassandra column family

北战南征 提交于 2019-12-10 09:55:47
问题 I am trying to use Apache Spark to process my large (~230k entries) cassandra dataset, but I am constantly running into different kinds of errors. However I can successfully run applications when running on a dataset ~200 entries. I have a spark setup of 3 nodes with 1 master and 2 workers, and the 2 workers also have a cassandra cluster installed with data indexed with a replication factor of 2. My 2 spark workers show 2.4 and 2.8 GB memory on the web interface and I set spark.executor

How to retrieve Metrics like Output Size and Records Written from Spark UI?

老子叫甜甜 提交于 2019-12-10 02:13:45
问题 How do I collect these metrics on a console (Spark Shell or Spark submit job) right after the task or job is done. We are using Spark to load data from Mysql to Cassandra and it is quite huge (ex: ~200 GB and 600M rows). When the task the done, we want to verify how many rows exactly did spark process? We can get the number from Spark UI, but how can we retrieve that number ("Output Records Written") from spark shell or in spark-submit job. Sample Command to load from Mysql to Cassandra. val