问题
For example on docs.datastax.com we mention :
table1 = sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="kv", keyspace="ks").load()
and its the only way I know, but lets say that I want to load only the last one million entries from this table. I don't want to load the whole table in memory every time, especially if this table has for example, over 10 million entries.
Thanks!
回答1:
While you can't load data faster. You can load portions of the data or terminate early. Spark DataFrames utilize catalyst to optimize it's underlying query plans enables it to take some short cuts.
For example calling limit
will allow Spark to skip reading some portions from the underlying DataSource. These would limit the amount of data read from Cassandra by canceling tasks from being executed.
Calling filter, or adding filters can be utilized by the underlying Datasource to help restrict the amount of information actually pulled from Cassandra. There are limitations on what can be pushed down but this is all detailed in the documentation.
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md#pushing-down-clauses-to-cassandra
Note all of this is accomplished by simply doing further api calls on your DataSource once you call it. For example
val df = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(table="kv", keyspace="ks")
.load()
df.show(10) // Will compute only enough tasks to get 10 records and no more
df.filter(clusteringKey > 5).show() //Will pass down the clustering predicate to C*
来源:https://stackoverflow.com/questions/40987667/how-can-you-pushdown-predicates-to-cassandra-or-limit-requested-data-when-using