How can you pushdown predicates to Cassandra or limit requested data when using Pyspark / Dataframes?

问题

For example on docs.datastax.com we mention :

table1 = sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="kv", keyspace="ks").load()

and its the only way I know, but lets say that I want to load only the last one million entries from this table. I don't want to load the whole table in memory every time, especially if this table has for example, over 10 million entries.

Thanks!

回答1:

While you can't load data faster. You can load portions of the data or terminate early. Spark DataFrames utilize catalyst to optimize it's underlying query plans enables it to take some short cuts.

For example calling limit will allow Spark to skip reading some portions from the underlying DataSource. These would limit the amount of data read from Cassandra by canceling tasks from being executed.

Calling filter, or adding filters can be utilized by the underlying Datasource to help restrict the amount of information actually pulled from Cassandra. There are limitations on what can be pushed down but this is all detailed in the documentation.

https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md#pushing-down-clauses-to-cassandra

Note all of this is accomplished by simply doing further api calls on your DataSource once you call it. For example

val df = sqlContext
  .read
  .format("org.apache.spark.sql.cassandra")
  .options(table="kv", keyspace="ks")
  .load()

df.show(10) // Will compute only enough tasks to get 10 records and no more
df.filter(clusteringKey > 5).show() //Will pass down the clustering predicate to C*

来源：https://stackoverflow.com/questions/40987667/how-can-you-pushdown-predicates-to-cassandra-or-limit-requested-data-when-using

标签

apache-spark

cassandra

pyspark