spark-cassandra-connector

Setting number of Spark tasks on a Cassandra table scan

寵の児 提交于 2019-12-06 05:15:54
I have a simple Spark job reading 500m rows from a 5 node Cassandra cluster that always runs 6 tasks, which is causing write issues due to the size of each task. I have tried adjusting the input_split_size, which seems to have no effect. At the moment I am forced to repartition the table scan, which is not ideal as it's expensive. Having read a few posts I tried to increase the num-executors in my launch script (below), although this had no effect. If there is no way to set the number of tasks on a Cassandra table scan, that's fine I'll make do.. but I have this constant niggling feeling that

SSL between Kafka and Spark

拈花ヽ惹草 提交于 2019-12-06 03:56:24
We are using Kafka,Spark Streaming and loading data to Cassandra Need to implement a security layer between nodes running kafka and nodes running spark. Any guidance on how to implement SSL between kafka and spark nodes ? Thanks Sreeni 来源: https://stackoverflow.com/questions/37743490/ssl-between-kafka-and-spark

Apache Spark fails to process a large Cassandra column family

五迷三道 提交于 2019-12-06 03:03:59
I am trying to use Apache Spark to process my large (~230k entries) cassandra dataset, but I am constantly running into different kinds of errors. However I can successfully run applications when running on a dataset ~200 entries. I have a spark setup of 3 nodes with 1 master and 2 workers, and the 2 workers also have a cassandra cluster installed with data indexed with a replication factor of 2. My 2 spark workers show 2.4 and 2.8 GB memory on the web interface and I set spark.executor.memory to 2409 when running an application, to get a combined memory of 4.7 GB. Here is my WebUI Homepage

Spark is executing every single action two times

≯℡__Kan透↙ 提交于 2019-12-05 15:06:47
I have created a simple Java application that uses Apache Spark to retrieve data from Cassandra, do some transformation on it and save it in another Cassandra table. I am using Apache Spark 1.4.1 configured in a standalone cluster mode with a single master and slave, located on my machine. DataFrame customers = sqlContext.cassandraSql("SELECT email, first_name, last_name FROM customer " + "WHERE CAST(store_id as string) = '" + storeId + "'"); DataFrame customersWhoOrderedTheProduct = sqlContext.cassandraSql("SELECT email FROM customer_bought_product " + "WHERE CAST(store_id as string) = '" +

scala.ScalaReflectionException: <none> is not a term

你说的曾经没有我的故事 提交于 2019-12-05 02:35:41
I have the following piece of code in Spark: rdd .map(processFunction(_)) .saveToCassandra("keyspace", "tableName") Where def processFunction(src: String): Seq[Any] = src match { case "a" => List(A("a", 123112, "b"), A("b", 142342, "c")) case "b" => List(B("d", 12312, "e", "f"), B("g", 12312, "h", "i")) } Where: case class A(entity: String, time: Long, value: String) case class B(entity: String, time: Long, value1: String, value2: String) saveToCassandra expects a collection of objects and using Seq[Any] as the return type to contain both Seq[A] and Seq[B] breaks saveToCassandra with the

Cassandra Reading Benchmark with Spark

独自空忆成欢 提交于 2019-12-04 17:08:25
I'm doing a benchmark on Cassandra's Reading performance. In the test-setup step I created a cluster with 1 / 2 / 4 ec2-instances and data nodes. I wrote 1 table with 100 million of entries (~3 GB csv-file). Then I launch a Spark application which reads the data into a RDD using the spark-cassandra-connector. However, I thought the behavior should be the following: The more instances Cassandra (same instance amount on Spark) uses, the faster the reads! With the writes everything seems to be correct (~2-times faster if cluster 2-times larger). But: In my benchmark the read is always faster with

Unable to serialize SparkContext in foreachRDD

南笙酒味 提交于 2019-12-04 15:23:52
I am trying to save the streaming data to cassandra from Kafka. I am able to read and parse the data but when I call below lines to save the data i am getting a Task not Serializable Exception. My class is extending serializable but not sure why i am seeing this error, didn't get much help ever after googling for 3 hours, can some body give any pointers ? val collection = sc.parallelize(Seq((obj.id, obj.data))) collection.saveToCassandra("testKS", "testTable ", SomeColumns("id", "data"))` import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.sql

How to resolve Guava dependency issue while submitting Uber Jar to Google Dataproc

。_饼干妹妹 提交于 2019-12-04 05:25:45
问题 I am using maven shade plugin to build Uber jar for submitting it as a job to google dataproc cluster. Google have installed Apache Spark 2.0.2 Apache Hadoop 2.7.3 on their cluster. Apache spark 2.0.2 uses 14.0.1 of com.google.guava and apache hadoop 2.7.3 uses 11.0.2, these both should be in the classpath already. <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>3.0.0</version> <executions> <execution> <phase>package</phase> <goals>

Apache Spark SQL is taking forever to count billion rows from Cassandra?

大憨熊 提交于 2019-12-04 05:04:51
问题 I have the following code I invoke spark-shell as follows ./spark-shell --conf spark.cassandra.connection.host=170.99.99.134 --executor-memory 15G --executor-cores 12 --conf spark.cassandra.input.split.size_in_mb=67108864 code scala> val df = spark.sql("SELECT test from hello") // Billion rows in hello and test column is 1KB df: org.apache.spark.sql.DataFrame = [test: binary] scala> df.count [Stage 0:> (0 + 2) / 13] // I dont know what these numbers mean precisely. If I invoke spark-shell as

sbt unresolved dependency for spark-cassandra-connector 2.0.2

拥有回忆 提交于 2019-12-02 14:59:52
问题 build.sbt: val sparkVersion = "2.1.1"; libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion % "provided"; libraryDependencies += "org.apache.spark" %% "spark-sql" % sparkVersion % "provided"; libraryDependencies += "org.apache.spark" %% "spark-streaming" % sparkVersion % "provided"; libraryDependencies += "com.datastax.spark" % "spark-cassandra-connector" % "2.0.2"; libraryDependencies += "org.apache.spark" % "spark-streaming-kafka-0-10_2.11" % sparkVersion; output: [error