Why Apache Spark is performing the filters on client

问题

Being newbie on apache spark, facing some issue on fetching Cassandra data on Spark.

List<String> dates = Arrays.asList("2015-01-21","2015-01-22");
CassandraJavaRDD<A> aRDD = CassandraJavaUtil.javaFunctions(sc).
                    cassandraTable("testing", "cf_text",CassandraJavaUtil.mapRowTo(A.class, colMap)).
                    where("Id=? and date IN ?","Open",dates);

This query is not filtering data on the cassandra server. While this java statement is executing its shooting up the memory and finally throwing spark java.lang.OutOfMemoryError exception. Query should be filtering out data on the cassandra server instead of client side as mentioned on https://github.com/datastax/spark-cassandra-connector/blob/master/doc/3_selection.md.

While i am performing the query with filters on the cassandra cqlsh its performing fine but performing the query without the filter (where clause) is giving timeout which is expected. So its clear that spark is not applying the filters on the client side.

SparkConf conf = new SparkConf();
            conf.setAppName("Test");
            conf.setMaster("local[8]");
            conf.set("spark.cassandra.connection.host", "192.168.1.15")

Why filters are applied on the client side and how it can be improved to apply the filters on the server side.

How we could configure the spark cluster on top of the cassandra cluster on windows platform??

回答1:

Not having used Cassandra with Spark, from reading the section you provided (thanks for that) I see that:

Note: Although the ALLOW FILTERING clause is implicitly added to the generated CQL query, not all predicates are currently allowed by the Cassandra engine. This limitation is going to be addressed in the future Cassandra releases. Currently, ALLOW FILTERING works well with columns indexed by secondary indexes or clustering columns.

I'm pretty sure (but haven't tested) that the "IN" predicate is not supported: See https://github.com/datastax/spark-cassandra-connector/blob/24fbe6a10e083ddc3f770d1f52c07dfefeb7f59a/spark-cassandra-connector-java/src/main/java/com/datastax/spark/connector/japi/rdd/CassandraJavaRDD.java#L80

So, you could try to limit your where-clause to Id (assuming that there is a secondary index for that) and use spark filtering for the date range.

回答2:

I'd suggest reading the table in as a DataFrame instead of an RDD. Those are available in Spark 1.3 and higher. Then you can specify the CQL query as a string like this:

CassandraSQLContext sqlContext = new CassandraSQLContext(sc);

String query = "SELECT * FROM testing.cf_text where id='Open' and date IN ('2015-01-21','2015-01-22')";
DataFrame resultsFrame = sqlContext.sql(query);

System.out.println(resultsFrame.count());

So try that and see if it works better for you.

Once you have the data in a DataFrame, you can run Spark SQL operations on it. And if you want the data in an RDD, you can convert the DataFrame into an RDD.

回答3:

setting spark.cassandra.input.split.size_in_mb in SparkConfing solved the issue.

conf = new SparkConf();
        conf.setAppName("Test");
        conf.setMaster("local[4]");
        conf.set("spark.cassandra.connection.host", "192.168.1.15").
        set("spark.executor.memory", "2g").
        set("spark.cassandra.input.split.size_in_mb", "67108864");

Spark-cassnadra-connector reads the wrong value of spark.cassandra.input.split.size_in_mb so overriding this value in the SparkConf does the work. Now IN clause is also working good.

来源：https://stackoverflow.com/questions/31141998/why-apache-spark-is-performing-the-filters-on-client

标签

java

apache-spark

out-of-memory

cassandra-2.0

spark-cassandra-connector