spark-cassandra-connector

How to fix 'ClassCastException: cannot assign instance of' - Works local but not in standalone on cluster

匆匆过客 提交于 2019-12-25 01:32:15
问题 I have a Spring web application(built in maven) with which I connect to my spark cluster(4 workers and 1 master) and to my cassandra cluster(4 nodes). The application starts, the workers communicate with the master and the cassandra cluster is also running. However when I do a PCA(spark mllib) or any other calculation(clustering, pearson, spearman) through the interface of my web-app I get the following error: java.lang.ClassCastException: cannot assign instance of scala.collection.immutable

java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;) :Sparkcassandra connector

99封情书 提交于 2019-12-24 20:09:41
问题 I have been using Spark 1.6 and Cassandra connector 1.4.3 to write data to cassandra from spark. Now, we upgraded to Spark 2.1.0 and I tried to upgrade the cassandra connector to 2.0.0-M3 , but it is returning the error as below:- java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaMirrors$JavaMirror; at com.datastax.spark.connector.util.JavaApiHelper$.mirror(JavaApiHelper.scala:25) at com.datastax.spark.connector.util

Task had a not serializable result in spark

拥有回忆 提交于 2019-12-24 17:12:42
问题 I am trying to read cassandra table using the cassandra driver to the spark. Here is the code. val x = 1 to 2 val rdd = sc.parallelize(x) val query = "Select data from testkeyspace.testtable where id=%d" val cc = CassandraConnector(sc.getConf) val res1 = rdd.map{ it => cc.withSessionDo{ session => session.execute( query.format(it)) } } res1.take(1).foreach(println) but I am getting the exception Task had a not serializable result. org.apache.spark.SparkException: Job aborted due to stage

Java.lang.IllegalArgumentException: requirement failed: Columns not found in Double

纵然是瞬间 提交于 2019-12-24 08:46:56
问题 I am working in spark I have many csv files that contain lines, a line looks like that: 2017,16,16,51,1,1,4,-79.6,-101.90,-98.900 It can contain more or less fields, depends on the csv file Each file corresponds to a cassandra table, where I need to insert all the lines the file contains so what I basically do is get the line, split its elements and put them in a List[Double] sc.stop import com.datastax.spark.connector._, org.apache.spark.SparkContext, org.apache.spark.SparkContext._, org

Spark Streaming Filtering the Streaming data

余生颓废 提交于 2019-12-24 06:48:44
问题 I am trying to filter the Streaming Data, and based on the value of the id column i want to save the data to different tables i have two tables testTable_odd (id,data1,data2) testTable_even (id,data1) if the id value is odd then i want to save record to testTable_odd table and if the value is even then i want to save record to testTable_even. the tricky part here is my two tables has different columns. tried multiple ways, considered Scala functions with return type Either[obj1,obj2], but i

Spark 1.5.1 + Scala 2.10 + Kafka + Cassandra = Java.lang.NoSuchMethodError:

时光毁灭记忆、已成空白 提交于 2019-12-23 01:55:10
问题 I want to connect Kafka + Cassandra to the Spark 1.5.1. The versions of the libraries: scalaVersion := "2.10.6" libraryDependencies ++= Seq( "org.apache.spark" % "spark-streaming_2.10" % "1.5.1", "org.apache.spark" % "spark-streaming-kafka_2.10" % "1.5.1", "com.datastax.spark" % "spark-cassandra-connector_2.10" % "1.5.0-M2" ) The initialization and use into app: val sparkConf = new SparkConf(true) .setMaster("local[2]") .setAppName("KafkaStreamToCassandraApp") .set("spark.executor.memory",

Setting number of Spark tasks on a Cassandra table scan

随声附和 提交于 2019-12-22 16:57:15
问题 I have a simple Spark job reading 500m rows from a 5 node Cassandra cluster that always runs 6 tasks, which is causing write issues due to the size of each task. I have tried adjusting the input_split_size, which seems to have no effect. At the moment I am forced to repartition the table scan, which is not ideal as it's expensive. Having read a few posts I tried to increase the num-executors in my launch script (below), although this had no effect. If there is no way to set the number of

Saving data back into Cassandra as RDD

為{幸葍}努か 提交于 2019-12-22 12:46:09
问题 I am trying to read messages from Kafka, process the data, and then add the data into cassandra as if it is an RDD. My trouble is saving the data back into cassandra. from __future__ import print_function from pyspark.streaming import StreamingContext from pyspark.streaming.kafka import KafkaUtils from pyspark import SparkConf, SparkContext appName = 'Kafka_Cassandra_Test' kafkaBrokers = '1.2.3.4:9092' topic = 'test' cassandraHosts = '1,2,3' sparkMaster = 'spark://mysparkmaster:7077' if _

Scala Spark Filter RDD using Cassandra

别来无恙 提交于 2019-12-22 01:12:04
问题 I am new to spark-Cassandra and Scala. I have an existing RDD. let say: ((url_hash, url, created_timestamp )). I want to filter this RDD based on url_hash. If url_hash exists in the Cassandra table then I want to filter it out from the RDD so I can do processing only on the new urls. Cassandra Table looks like following: url_hash| url | created_timestamp | updated_timestamp Any pointers will be great. I tried something like this this: case class UrlInfoT(url_sha256: String, full_url: String,

Spark Cassandra connector filtering with IN clause

北战南征 提交于 2019-12-21 20:47:54
问题 I am facing some issues with spark cassandra connector filtering for java. Cassandra allows the filtering by last column of the partition key with IN clause. e.g create table cf_text (a varchar,b varchar,c varchar, primary key((a,b),c)) Query : select * from cf_text where a ='asdf' and b in ('af','sd'); sc.cassandraTable("test", "cf_text").where("a = ?", "af").toArray.foreach(println) How count I specify the IN clause which is used in the CQL query in spark? How range queries can be specified