Aggregation in Cassandra across partitions

问题

I have a Data model like below,

CREATE TABLE appstat.nodedata (
    nodeip text,
    timestamp timestamp,
    flashmode text,
    physicalusage int,
    readbw int,
    readiops int,
    totalcapacity int,
    writebw int,
    writeiops int,
    writelatency int,
    PRIMARY KEY (nodeip, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC)

where, nodeip - primary key and timestamp - clustering key (Sorted by descinding oder to get the latest),

Sample data in this table,

SELECT * from nodedata WHERE nodeip = '172.30.56.60' LIMIT 2;

 nodeip       | timestamp                       | flashmode | physicalusage | readbw | readiops | totalcapacity | writebw | writeiops | writelatency
--------------+---------------------------------+-----------+---------------+--------+----------+---------------+---------+-----------+--------------
 172.30.56.60 | 2017-12-08 06:13:07.161000+0000 |       yes |            34 |     57 |       19 |            27 |       8 |        89 |           57
 172.30.56.60 | 2017-12-08 06:12:07.161000+0000 |       yes |            70 |      6 |       43 |            88 |      79 |        83 |           89

This is properly available and whenever I need to get the statistics I am able to get the data using the partition key like below,

SELECT nodeip,readbw,timestamp FROM nodedata WHERE nodeip = '172.30.56.60' AND timestamp < 1512652272989 AND timestamp > 1512537899000;

Also successfully aggregate the data like below,

SELECT sum(readbw) FROM nodedata WHERE nodeip = '172.30.56.60' AND timestamp < 1512652272989 AND timestamp > 1512537899000;

Now comes the next use case, Where I need to get the cluster data (All the data's of the four nodes),

Like below,

SELECT nodeip,readbw,timestamp FROM nodedata WHERE nodeip IN ('172.30.56.60','172.30.56.61','172.30.56.62','172.30.56.63') AND timestamp < 1512652272989 AND timestamp > 1512537899000;

But It clearly mentioned in number of sites that, 'IN query' has lots of performance hiccups, So what is your suggestion in this Data Model of 'nodedata' table mentioned above? (NOTE: Doing Multiple queries in different partitions are okay which I feel like a last option)

Do you have a better approach (or) re-designing this data model in a better way (or) Any better solution to retrieve the data from multiple partitions?

Any help would be really appreciable.

Thanks,
Harry

回答1:

Yes, the use of IN on the partition key is discouraged as it put more load on coordinating node, especially if many partitions will be specified in IN clause. Multiple separate requests done async, for example, could even be more performant, and make less load on coordinating nodes.

Also, you need into account the size of your partitions - from quick look to schema, I see that every partition will grow to ~55Mb in one year if you're doing sampling every minute. Having too wide partitions could also lead to some performance problems (although not always, depends on the use case). Maybe you'll need to change partition key to include year, or year+month to make smaller partitions. But in this case, some additional logic should be added to your code when you retrieve data that span several years/months.

P.S. Maybe this is not a fully answering to your question, but commentary field is too small for it :-)

来源：https://stackoverflow.com/questions/47709008/aggregation-in-cassandra-across-partitions

标签

cassandra

data-modeling

cqlsh

cassandra-3.0