How can you pushdown predicates to Cassandra or limit requested data when using Pyspark / Dataframes?

眉间皱痕 提交于 2019-12-25 04:23:20

问题


For example on docs.datastax.com we mention :

table1 = sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="kv", keyspace="ks").load()

and its the only way I know, but lets say that I want to load only the last one million entries from this table. I don't want to load the whole table in memory every time, especially if this table has for example, over 10 million entries.

Thanks!


回答1:


While you can't load data faster. You can load portions of the data or terminate early. Spark DataFrames utilize catalyst to optimize it's underlying query plans enables it to take some short cuts.

For example calling limit will allow Spark to skip reading some portions from the underlying DataSource. These would limit the amount of data read from Cassandra by canceling tasks from being executed.

Calling filter, or adding filters can be utilized by the underlying Datasource to help restrict the amount of information actually pulled from Cassandra. There are limitations on what can be pushed down but this is all detailed in the documentation.

https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md#pushing-down-clauses-to-cassandra

Note all of this is accomplished by simply doing further api calls on your DataSource once you call it. For example

val df = sqlContext
  .read
  .format("org.apache.spark.sql.cassandra")
  .options(table="kv", keyspace="ks")
  .load()

df.show(10) // Will compute only enough tasks to get 10 records and no more
df.filter(clusteringKey > 5).show() //Will pass down the clustering predicate to C*


来源:https://stackoverflow.com/questions/40987667/how-can-you-pushdown-predicates-to-cassandra-or-limit-requested-data-when-using

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!