Does Cassandra read the whole row when limiting the number of requested results?

问题

I am using cassandra 2.0.6. and have this table:

CREATE TABLE t (
    id text,
    idx bigint,
    data bigint,
    PRIMARY KEY (id, idx)
)

So say I got these rows:

id / idx / data
x    1     data1
x    2     data2
x    3     data3

.... goes on say 1000 rows for x

If I query :

select * from t where id='x' order by idx limit 1

Will cassandra fetch all the 1000 rows , or only a small part of it?

Reading articles like http://www.ebaytechblog.com/2012/08/14/cassandra-data-modeling-best-practices-part-2/#.UzrvLKZx2PI , it seems it will fetch only a small part of it. But running some stress tests and the more data I have in the table, the more MB/sec disk IO I get.

For 8GB of data I was getting 3MB/sec IO (reads) For 12GB of data I was getting 15MB/sec IO (reads) For 20GB of data, I am currently getting 35MB/sec IO (reads)

I don't see anything weird in cfhistograms:

SSTables per Read
1 sstables: 421010
2 sstables: 552
3 sstables: 9
4 sstables: 0
5 sstables: 254
6 sstables: 3221
7 sstables: 3063
8 sstables: 1029
10 sstables: 143

Read Latency (microseconds)
12 us: 6
14 us: 36
17 us: 471
20 us: 2795
24 us: 10799
29 us: 18594
35 us: 24693
42 us: 43078
50 us: 67438
60 us: 68872
72 us: 70718
86 us: 47300
103 us: 23471
124 us: 11752
149 us: 4509
179 us: 1437
215 us: 832
258 us: 3444
310 us: 7883
372 us: 2374
446 us: 736
535 us: 624
642 us: 581
770 us: 1875
924 us: 1715
1109 us: 2889
1331 us: 3705
1597 us: 2197
1916 us: 1320
2299 us: 826
2759 us: 639
3311 us: 431
3973 us: 312
4768 us: 213
5722 us: 106
6866 us: 72
8239 us: 44
9887 us: 36
11864 us: 25
14237 us: 16
17084 us: 23
20501 us: 20
24601 us: 15
29521 us: 28
35425 us: 21
42510 us: 20
51012 us: 49
61214 us: 49
73457 us: 29
88148 us: 23
105778 us: 35
126934 us: 23
152321 us: 17
182785 us: 13
219342 us: 10
263210 us: 8
315852 us: 3
379022 us: 8
454826 us: 10

回答1:

You get more I/O as you are ordering and limiting on the fly. If you are sure about the order in which you want to fetch the data , use clusterordering on the column family at the time of creation itself

create table tablename(.......) with cluster order by (idx desc)

By this way, all your inserts are ordered by idx in descending order by default. Hence , when you apply limit on it,you shall reduce the disk I/O

回答2:

Once you have done the clustering order , your ordering time is saved now. If you are facing problem with large amounts of data, it will be due to the compaction strategy used. I feel you are using a size tiered compaction strategy on read heavy column family. Try the same scenario with Leveled compaction strategy.

When you use size tiered compaction, you are spreading your data across multiple stables and you are bound to get data out of all each time. So , a read heavy column family doesn't bode well with this.

回答3:

I found out that I was actually accidentally exhausting the resultset iterator, fixed that and now IO is normal.

来源：https://stackoverflow.com/questions/22792260/does-cassandra-read-the-whole-row-when-limiting-the-number-of-requested-results

标签

cassandra

column-family

super-columns