SELECT DISTINCT cql ignores WHERE clause

被刻印的时光 ゝ 提交于 2019-12-03 11:48:59

It happens that way because in Cassandra CQL DISTINCT is designed to return only the partition (row) keys of your table (column family)...which must be unique. Therefore, the WHERE clause can only operate on partition keys when used with DISTINCT (which in your case, isn't terribly useful). If you take the DISTINCT out, WHERE can then be used to evaluate the clustering (column) keys within each partition key (albeit, with ALLOW FILTERING).

I feel compelled to mention that ALLOW FILTERING is not something you should be doing a whole lot of...and definitely not in production. If that query is one you need to run often (querying events for userids after a certain timestamp) then I would suggest partitioning your data by event_type instead:

PRIMARY KEY (event_type, "timestamp", userid)

Then you'll be able to run this query without ALLOW FILTERING.

SELECT userid FROM events WHERE event_type='toto' AND timestamp > '1970-01-17 09:07:17+0100'

Without knowing anything about your application or use case, that may or may not be useful to you. But consider it as an example, and as an indication that there may be a better way build your model to satisfy your query pattern(s). Check out Patrick McFadin's article on timeseries data modeling for more ideas on how to model for this problem.

As explained by Aaron, when using the DISTINCT keyword, you can only filter by partition keys. The reason behind this is the algorithm behind DISTINCT queries and the way Cassandra stores the data into disk/memory.

To understand this, I'll make an analogy:

Cassandra stores the information similar to a book index. If you are searching a chapter called "My third chapter" you only have to look at the first level of the index for it, so you only need to do an iterative search in a relatively small set. However, if you are looking for a sub-chapter called "My fourth sub-chapter" belonging to "My second chapter" you will have to do 2 iterative searchs in 2 different sets, both small, provided that the index has at least 2 levels. The deeper you need to go the longer it may take (you still may be lucky and find it very fast if it is at the start of the index but in this kind of algorithms you have to test for the mean and the worst case scenario) and the more complex the index will need to be.

Cassandra does something similar: Keyspace -> Table -> Partition Key -> Clustering Key -> Column The deeper you need to go, more sets you need to have in memory and it will take longer to find anything. The index used to execute DISTINCT queries may even just contain sets until the partition key level, thus only allowing to search for partition keys.

You need to realise that searching any chapter that has a sub-chapter calles "My second sub-chapter" (what would be the analogy to your query) still requires 2 level deep index and 2 level iterative searchs.

If they decide to support DISTINCT use on clustering keys, then your query would be fine. Meanwhile you will have to filter them in the aplication, probably by using a built-in type called set or something similar that handles the repeated values by itself.

Nor the solution proposed by Aaron (using the userid as a clustering key after the timestamp) neither this one (filtering in the client-side) uses the DISTINCT fast mechanism. His proposal doesn't require client-side filtering as it already handles that for you but offers two main drawbacks: it doesn't offer backwards compatibility as you will have to recreate the table and uses a constant partition key and thus doesn't allow Cassandra to distribute this data among its nodes. Remember that every value of the same partition key is stored in the same node.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!