Cassandra - Overlapping Data Ranges

I have the following 'Tasks' table in Cassandra.

Task_ID UUID - Partition Key
Starts_On TIMESTAMP - Clustering Column
Ends_On TIMESTAMP - Clustering Column

I want to run a CQL query to get the overlapping tasks for a given date range. For example, if I pass in two timestamps (T1 and T2) as parameters to the query, I want to get the all tasks that are applicable with in that range (that is, overlapping records).

What is the best way to do this in Cassandra? I cannot just use two ranges on Starts_On and Ends_On here because to add a range query to Ends_On, I have to have a equality check for Starts_On.

Here's another idea (somewhat unconventional). You could create a user defined function to implement the second range filter (in Cassandra 2.2 and newer).

Suppose you define your table like this (shown with ints instead of timestamps to keep the example simple):

CREATE TABLE tasks (
    p int, 
    task_id timeuuid, 
    start int, 
    end int, 
    end_range int static, 
    PRIMARY KEY(p, start));

Now we create a user defined function to check returned rows based on the end time, and return the task_id of matching rows, like this:

CREATE FUNCTION my_end_range(task_id timeuuid, end int, end_range int) 
    CALLED ON NULL INPUT RETURNS timeuuid LANGUAGE java AS 
    'if (end <= end_range) return task_id; else return null;';

Now I'm using a trick there with the third parameter. In an apparent (major?) oversight, it appears you can't pass a constant to a user defined function. So to work around that, we pass a static column (end_range) as our constant.

So first we have to set the end_range we want:

UPDATE tasks SET end_range=15 where p=1;

And let's say we have this data:

SELECT * FROM tasks;

 p | start | end_range | end | task_id
---+-------+-----------+-----+--------------------------------------
 1 |     1 |        15 |   5 | 2c6e9340-4a88-11e5-a180-433e07a8bafb
 1 |     2 |        15 |   7 | 3233a040-4a88-11e5-a180-433e07a8bafb
 1 |     4 |        15 |  22 | f98fd9b0-4a88-11e5-a180-433e07a8bafb
 1 |     8 |        15 |  15 | 37ec7840-4a88-11e5-a180-433e07a8bafb

Now let's get the task_id's that have start >= 2 and end <= 15:

SELECT start, end, my_end_range(task_id, end, end_range) FROM tasks 
    WHERE p=1 AND start >= 2;

 start | end | test.my_end_range(task_id, end, end_range)
-------+-----+--------------------------------------------
     2 |   7 |       3233a040-4a88-11e5-a180-433e07a8bafb
     4 |  22 |                                       null
     8 |  15 |       37ec7840-4a88-11e5-a180-433e07a8bafb

So that gives you the matching task_id's and you have to ignore the null rows (I haven't figured out a way to drop rows using UDF's). You'll note that the filter of start >= 2 dropped one row before passing it to the UDF.

Anyway not a perfect method obviously, but it might be something you can work with. :)

In CQL you can only range query on one clustering column at a time, so you'll probably need to do some kind of client side filtering in your application. So you could range query on starts_on, and as rows are returned, check ends_on in your application and discard rows that you don't want.

A while ago I wrote an application that faced a similar problem, in querying events that had both start and end times. For our scenario, I was able to partition on a userID (as queries were for events of a specific user), set a clustering column for type of event, and also for event date. The table structure looked something like this:

CREATE TABLE userEvents (
  userid UUID,
  eventTime TIMEUUID,
  eventType TEXT,
  eventDesc TEXT,
  PRIMARY KEY ((userid),eventTime,eventType));

With this structure, I can query by userid and eventtime:

SELECT userid,dateof(eventtime),eventtype,eventdesc FROM userevents 
  WHERE userid=dd95c5a7-e98d-4f79-88de-565fab8e9a68 
  AND eventtime >= mintimeuuid('2015-08-24 00:00:00-0500');

 userid                               | system.dateof(eventtime) | eventtype | eventdesc
--------------------------------------+--------------------------+-----------+-----------
 dd95c5a7-e98d-4f79-88de-565fab8e9a68 | 2015-08-24 08:22:53-0500 |       End |    event1
 dd95c5a7-e98d-4f79-88de-565fab8e9a68 | 2015-08-24 11:45:00-0500 |     Begin |     lunch
 dd95c5a7-e98d-4f79-88de-565fab8e9a68 | 2015-08-24 12:45:00-0500 |       End |     lunch

(3 rows)

That query will give me all event rows for a particular user for today.

NOTES:

If you need to query by whether or not an event is starting or ending (I did not) you will want to order eventType ahead of eventTime in the primary key.
You will store each event twice (once for the beginning, and once for the end). Duplication of data usually isn't much of a concern in Cassandra, but I did want to explicitly point that out.
In your case, you will want to find a good key to partition on, as Task_ID will be too unique (high cardinality). This is a must in Cassandra, as you cannot range query on a partition key (only a clustering key).

来源：https://stackoverflow.com/questions/32178825/cassandra-overlapping-data-ranges

标签

cassandra

cql

cassandra-2.0

cql3