How to batch select data from Cassandra effectively?

后端未结

关注

 2  998

谎友^ 2020-12-17 07:37

I know Cassandra doesn\'t support batch query, and it also doesn\'t recommend to use IN, because it can degrade performance. But I have to get the data by id, f

2条回答

独厮守ぢ (楼主)

2020-12-17 08:08
My preferred way to issue these kind of queries is to unroll the IN part. That simply means you need to issue multiple queries in parallel, simply because the token-o-matic (aka token-aware) driver will treat each query as a single independent query, and will then spread these among different nodes, making each single node the coordinator responsible for each query it will be reached for.

You should run at most X queries and wait until at least one of them finishes (I use Java):
```
final int X = 1000;
ArrayList futures = new ArrayList<>();
ArrayList results = new ArrayList<>();
for (int i = 0; i < allTheRowsINeedToFetch; i++) {
    futures.add(session.executeAsync(myBeautifulPreparedStatement.bind(xxx,yyy,zzz)));
    while (futures.size() >= X || (futures.size() > 0 && futures.get(0).isDone())) {
        ResultSetFuture rsf = futures.remove(0);
        results.add(rsf.getUninterruptibly());
    }
}

while (futures.size() > 0) {
    ResultSetFuture rsf = futures.remove(0);
    results.add(rsf.getUninterruptibly());
}

// Now use the results
```
This is known as backpressure, and it is used to move the pressure from the cluster to the client.

The nice thing about this method is that you can go truly parallel (X = allTheRowsINeedToFetch), as well as truly serial (X = 1), and everything in between depends only your cluster hardware. Low values of X mean you are not using your cluster capabilities enough, high values mean you're going to call for troubles because you'll start to see timeouts. So, you really need to tune it.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...