SELECT DISTINCT cql ignores WHERE clause

Executing two identical requests but the DISTINCT keyword gives unexpected results. Without the keyword, the result is ok but with DISTINCT, it looks like the where clause is ignored. Why ?

Cqlsh version:

Connected to Test Cluster at localhost:9160.
[cqlsh 4.1.1 | Cassandra 2.0.6 | CQL spec 3.1.1 | Thrift protocol 19.39.0]

Table considered:

DESCRIBE TABLE events;

CREATE TABLE events (
  userid uuid,
  "timestamp" timestamp,
  event_type text,
  data text,
  PRIMARY KEY (userid, "timestamp", event_type)
) WITH
  bloom_filter_fp_chance=0.010000 AND
  caching='KEYS_ONLY' AND
  comment='' AND
  dclocal_read_repair_chance=0.000000 AND
  gc_grace_seconds=864000 AND
  index_interval=128 AND
  read_repair_chance=0.100000 AND
  replicate_on_write='true' AND
  populate_io_cache_on_flush='false' AND
  default_time_to_live=0 AND
  speculative_retry='99.0PERCENTILE' AND
  memtable_flush_period_in_ms=0 AND
  compaction={'class': 'SizeTieredCompactionStrategy'} AND
  compression={'sstable_compression': 'LZ4Compressor'};

Table content:

SELECT * FROM events;

 userid                               | timestamp                | event_type | data
--------------------------------------+--------------------------+------------+------
 aaaaaaaa-be1c-44ab-a0e8-f25cf6064b0e | 1970-01-17 09:06:17+0100 |       toto | null
 4271a78f-be1c-44ab-a0e8-f25cf6064b0e | 1970-01-17 09:06:17+0100 |       toto | null
 4271a78f-be1c-44ab-a0e8-f25cf6064b0e | 1970-01-17 09:07:17+0100 |       toto | null
 4271a78f-be1c-44ab-a0e8-f25cf6064b0e | 1970-01-17 09:08:17+0100 |       toto | null
 4271a78f-be1c-44ab-a0e8-f25cf6064b0e | 1970-01-17 09:09:17+0100 |       toto | null
 4271a78f-be1c-44ab-a0e8-f25cf6064b0e | 1970-01-17 09:10:17+0100 |       toto | null

(6 rows)

Request1: Request without DISTINCT

SELECT userid FROM events WHERE timestamp > '1970-01-17 09:07:17+0100' ALLOW FILTERING;

 userid
--------------------------------------
 4271a78f-be1c-44ab-a0e8-f25cf6064b0e
 4271a78f-be1c-44ab-a0e8-f25cf6064b0e
 4271a78f-be1c-44ab-a0e8-f25cf6064b0e

(3 rows)

Request2: Same request with DISTINCT

SELECT DISTINCT userid FROM events WHERE timestamp > '1970-01-17 09:07:17+0100' ALLOW FILTERING;

 userid
--------------------------------------
 aaaaaaaa-be1c-44ab-a0e8-f25cf6064b0e
 4271a78f-be1c-44ab-a0e8-f25cf6064b0e

(2 rows)

EDIT 1
here is some context.
This table "events" is subject to a lot of writes, it receives around ~1k insertions per second and I have a batch script that checks those events every 5 minutes.
This batch script has 2 needs:
1- get all userids that have been active in the last 5 minutes (i.e every userid present in the events from the last 5 minutes)
2- get all events related to those userids (not only for the last 5 minutes)

I used to have two different tables to handle this. One table "activeusers" for the first request and the "events" table like I have described here for the second request. My problem with that is just that it requires from my server to write in two different tables when it receives an event. So I tried this using only the events table.

It happens that way because in Cassandra CQL DISTINCT is designed to return only the partition (row) keys of your table (column family)...which must be unique. Therefore, the WHERE clause can only operate on partition keys when used with DISTINCT (which in your case, isn't terribly useful). If you take the DISTINCT out, WHERE can then be used to evaluate the clustering (column) keys within each partition key (albeit, with ALLOW FILTERING).

I feel compelled to mention that ALLOW FILTERING is not something you should be doing a whole lot of...and definitely not in production. If that query is one you need to run often (querying events for userids after a certain timestamp) then I would suggest partitioning your data by event_type instead:

PRIMARY KEY (event_type, "timestamp", userid)

Then you'll be able to run this query without ALLOW FILTERING.

SELECT userid FROM events WHERE event_type='toto' AND timestamp > '1970-01-17 09:07:17+0100'

Without knowing anything about your application or use case, that may or may not be useful to you. But consider it as an example, and as an indication that there may be a better way build your model to satisfy your query pattern(s). Check out Patrick McFadin's article on timeseries data modeling for more ideas on how to model for this problem.

As explained by Aaron, when using the DISTINCT keyword, you can only filter by partition keys. The reason behind this is the algorithm behind DISTINCT queries and the way Cassandra stores the data into disk/memory.

To understand this, I'll make an analogy:

Cassandra stores the information similar to a book index. If you are searching a chapter called "My third chapter" you only have to look at the first level of the index for it, so you only need to do an iterative search in a relatively small set. However, if you are looking for a sub-chapter called "My fourth sub-chapter" belonging to "My second chapter" you will have to do 2 iterative searchs in 2 different sets, both small, provided that the index has at least 2 levels. The deeper you need to go the longer it may take (you still may be lucky and find it very fast if it is at the start of the index but in this kind of algorithms you have to test for the mean and the worst case scenario) and the more complex the index will need to be.

Cassandra does something similar: Keyspace -> Table -> Partition Key -> Clustering Key -> Column The deeper you need to go, more sets you need to have in memory and it will take longer to find anything. The index used to execute DISTINCT queries may even just contain sets until the partition key level, thus only allowing to search for partition keys.

You need to realise that searching any chapter that has a sub-chapter calles "My second sub-chapter" (what would be the analogy to your query) still requires 2 level deep index and 2 level iterative searchs.

If they decide to support DISTINCT use on clustering keys, then your query would be fine. Meanwhile you will have to filter them in the aplication, probably by using a built-in type called set or something similar that handles the repeated values by itself.

Nor the solution proposed by Aaron (using the userid as a clustering key after the timestamp) neither this one (filtering in the client-side) uses the DISTINCT fast mechanism. His proposal doesn't require client-side filtering as it already handles that for you but offers two main drawbacks: it doesn't offer backwards compatibility as you will have to recreate the table and uses a constant partition key and thus doesn't allow Cassandra to distribute this data among its nodes. Remember that every value of the same partition key is stored in the same node.

来源：https://stackoverflow.com/questions/26548788/select-distinct-cql-ignores-where-clause

标签

cassandra

distinct

cql