Dataframe where clause doesn't work when use spark cassandra connector

旧巷老猫 提交于 2019-12-07 04:06:28

Api Conflicts

Dataframes do not use the SparkCassandra connector api, so when you type where on a DataFrame it is actually invoking a Catalyst call. This is not being transferred to the underlying CQL layer but instead being applied in Spark itself. Spark doesn't know what "maxtimeuuid" is so it fails.

Filters rows using the given SQL expression.

See http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame

Since this predicate is invalid it will never reach the connector so we will not be able to process a clause like this at the datasource level.

Only the Spark Cassandra Connector RDD.where clause will directly pass CQL to the underlying RDD.

Adds a CQL WHERE predicate(s) to the query. Useful for leveraging secondary indexes in Cassandra. Implicitly adds an ALLOW FILTERING clause to the WHERE clause, however beware that some predicates might be rejected by Cassandra, particularly in cases when they filter on an unindexed, non-clustering column.

http://datastax.github.io/spark-cassandra-connector/ApiDocs/1.6.0-M1/spark-cassandra-connector/#com.datastax.spark.connector.rdd.CassandraRDD

Dataframes and TimeUUID

Comparing TimeUUIDs with Dataframes is going to be difficult since Catalyst has no notion of TimeUUID as a type so the Connector Reads them (through DataFrames) as a String. This is a problem because TimeUUIDs are not lexically comparable so you won't get the right answer even if you generate the TimeUUID and then compare with it directly instead of calling a function.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!