Cassandra Bulk-Write performance with Java Driver is atrocious compared to MongoDB

后端 未结 2 1639
灰色年华
灰色年华 2021-01-07 02:50

I have built an importer for MongoDB and Cassandra. Basically all operations of the importer are the same, except for the last part where data gets formed to match the neede

2条回答
  •  粉色の甜心
    2021-01-07 03:33

    After using C* for a bit, I'm convinced you should really use batches only for keeping multiple tables in sync. If you don't need that feature, then don't use batches at all because you will incur in performance penalties.

    The correct way to load data into C* is with async writes, with optional backpressure if your cluster can't keep up with the ingestion rate. You should replace your "custom" batching method with something that:

    • performs async writes
    • keep under control how many inflight writes you have
    • perform some retry when a write timeouts.

    To perform async writes, use the .executeAsync method, that will return you a ResultSetFuture object.

    To keep under control how many inflight queries just collect the ResultSetFuture object retrieved from the .executeAsync method in a list, and if the list gets (ballpark values here) say 1k elements then wait for all of them to finish before issuing more writes. Or you can wait for the first to finish before issuing one more write, just to keep the list full.

    And finally, you can check for write failures when you're waiting on an operation to complete. In that case, you could:

    1. write again with the same timeout value
    2. write again with an increased timeout value
    3. wait some amount of time, and then write again with the same timeout value
    4. wait some amount of time, and then write again with an increased timeout value

    From 1 to 4 you have an increased backpressure strength. Pick the one that best fit your case.


    EDIT after question update

    Your insert logic seems a bit broken to me:

    1. I don't see any retry logic
    2. You don't remove the item in the list if it fails
    3. Your while (concurrentInsertErrorOccured && runningInsertList.size() > concurrentInsertLimit) is wrong, because you will sleep only when the number of issued queries is > concurrentInsertLimit, and because of 2. your thread will just park there.
    4. You never set to false concurrentInsertErrorOccured

    I usually keep a list of (failed) queries for the purpose of retrying them at later time. That gives me powerful control on the queries, and when the failed queries starts to accumulate I sleep for a few moments, and then keep on retrying them (up to X times, then hard fail...).

    This list should be very dynamic, eg you add items there when queries fail, and remove items when you perform a retry. Now you can understand the limits of your cluster, and tune your concurrentInsertLimit based on eg the avg number of failed queries in the last second, or stick with the simpler approach "pause if we have an item in the retry list" etc...


    EDIT 2 after comments

    Since you don't want any retry logic, I would change your code this way:

    private List runningInsertList;
    private static int concurrentInsertLimit = 1000;
    private static int concurrentInsertSleepTime = 500;
    ...
    
    @Override
    public void executeBatch(Statement statement) throws InterruptedException {
        if (this.runningInsertList == null) {
            this.runningInsertList = new ArrayList<>();
        }
    
        ResultSetFuture future = this.executeAsync(statement);
        this.runningInsertList.add(future);
    
        Futures.addCallback(future, new FutureCallback() {
            @Override
            public void onSuccess(ResultSet result) {
                runningInsertList.remove(future);
            }
    
            @Override
            public void onFailure(Throwable t) {
                runningInsertList.remove(future);
                concurrentInsertErrorOccured = true;
            }
        }, MoreExecutors.sameThreadExecutor());
    
        //Sleep while the currently processing number of inserts is too high
        while (runningInsertList.size() >= concurrentInsertLimit) {
            Thread.sleep(concurrentInsertSleepTime);
        }
    
        if (!concurrentInsertErrorOccured) {
            // Increase your ingestion rate if no query failed so far
            concurrentInsertLimit += 10;
        } else {
            // Decrease your ingestion rate because at least one query failed
            concurrentInsertErrorOccured = false;
            concurrentInsertLimit = Max(1, concurrentInsertLimit - 50);
            while (runningInsertList.size() >= concurrentInsertLimit) {
                Thread.sleep(concurrentInsertSleepTime);
            }
        }
    
        return;
    }
    

    You could also optimize a bit the procedure by replacing your List with a counter.

    Hope that helps.

提交回复
热议问题