LOCAL_ONE and unexpected data replication with Cassandra

问题

FYI. We are running this test with Cassandra 2.1.12.1047 | DSE 4.8.4

We have a simple table in Cassandra that has a 5,000 rows of data in it. Some time back, as a precaution, we added monitoring on each Cassandra instance to ensure that it has 5,000 rows of data because our replication factor enforces this i.e. we have 2 replicas in every region and we have 6 servers in total in our dev cluster.

CREATE KEYSPACE example WITH replication = {'class': 'NetworkTopologyStrategy', 'ap-southeast-1-A': '2', 'eu-west-1-A': '2', 'us-east-1-A': '2'} AND durable_writes = true;

We recently forcibly terminated a server to simulate a failure and brought a new one online to see what would happen. We also removed the old node using nodetool removenode so that in each region we expected all data to exist on every server.

Once the new server came online, it joined the cluster, and seemingly started replicating the data. We assume because it is in bootstrap mode it will be responsible for ensuring it gets the data it needs from the cluster. CPU finally dropped after around an hour, and we assumed the replication was complete.

However, our monitors, which intentionally do queries using LOCAL_ONE on each server, reported that all servers had 5,000 rows, and the new server that was brought online was stuck with around 2,600 rows. We assumed that perhaps it was still replicating so we left it a while, but it stayed at that number.

So we ran nodetool status to check and got the following:

$ nodetool status my_keyspace
Datacenter: ap-southeast-1-A
======================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens  Owns (effective)  Host ID                               Rack
UN  54.255.17.28    7.9 GB     256     100.0%            a0c45f3f-8479-4046-b3c0-b2dd19f07b87  ap-southeast-1a
UN  54.255.64.1     8.2 GB     256     100.0%            b91c5863-e1e1-4cb6-b9c1-0f24a33b4baf  ap-southeast-1b
Datacenter: eu-west-1-A
=================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens  Owns (effective)  Host ID                               Rack
UN  176.34.207.151  8.51 GB    256     100.0%            30ff8d00-1ab6-4538-9c67-a49e9ad34672  eu-west-1b
UN  54.195.174.72   8.4 GB     256     100.0%            f00dfb85-6099-40fa-9eaa-cf1dce2f0cd7  eu-west-1c
Datacenter: us-east-1-A
=================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens  Owns (effective)  Host ID                               Rack
UN  54.225.11.249   8.17 GB    256     100.0%            0e0adf3d-4666-4aa4-ada7-4716e7c49ace  us-east-1e
UN  54.224.182.94   3.66 GB    256     100.0%            1f9c6bef-e479-49e8-a1ea-b1d0d68257c7  us-east-1d

So if the server is reporting that it owns 100% of the data, why is the LOCAL_ONE query only giving us roughly half the data?

When I did run a LOCAL_QUORUM query it returned 5,000 rows, and from that point forwards returned 5,000 even for LOCAL_ONE queries.

Whilst LOCAL_QUORUM solved the problem in this instance, we may in future need to do other types of queries on the assumption that each server a) has the data it should have, b) knows how to satisfy queries when it does not have the data i.e. it knows that data sits somewhere else on the ring.

FURTHER UPDATE 24 hours later - PROBLEM IS A LOT WORSE

So in the absence of any feedback on this issue, I have proceeded to experiment with this on the cluster by adding more nodes. According to https://docs.datastax.com/en/cassandra/1.2/cassandra/operations/ops_add_node_to_cluster_t.html, I have followed all the steps recommended to add nodes to the cluster and in effect, add capacity. I believe the premise of Cassandra is that as you add nodes, it is the Cluster's responsibility to rebalance the data and during that time, get the data from the position on the ring it is at if it's not where it should be.

Unfortunately that is not the case at all. Here is my new ring:

Datacenter: ap-southeast-1-A
======================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens  Owns (effective)  Host ID                               Rack
UN  54.255.xxx.xxx  8.06 GB    256     50.8%             a0c45f3f-8479-4046-b3c0-b2dd19f07b87  ap-southeast-1a
UN  54.254.xxx.xxx  2.04 MB    256     49.2%             e2e2fa97-80a0-4768-a2aa-2b63e2ab1577  ap-southeast-1a
UN  54.169.xxx.xxx  1.88 MB    256     47.4%             bcfc2ff0-67ab-4e6e-9b18-77b87f6b3df3  ap-southeast-1b
UN  54.255.xxx.xxx  8.29 GB    256     52.6%             b91c5863-e1e1-4cb6-b9c1-0f24a33b4baf  ap-southeast-1b
Datacenter: eu-west-1-A
=================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens  Owns (effective)  Host ID                               Rack
UN  54.78.xxx.xxx   8.3 GB     256     49.9%             30ff8d00-1ab6-4538-9c67-a49e9ad34672  eu-west-1b
UN  54.195.xxx.xxx  8.54 GB    256     50.7%             f00dfb85-6099-40fa-9eaa-cf1dce2f0cd7  eu-west-1c
UN  54.194.xxx.xxx  5.3 MB     256     49.3%             3789e2cc-032d-4b26-bff9-b2ee71ee41a0  eu-west-1c
UN  54.229.xxx.xxx  5.2 MB     256     50.1%             34811c15-de8f-4b12-98e7-0b4721e7ddfa  eu-west-1b
Datacenter: us-east-1-A
=================================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Tokens  Owns (effective)  Host ID                               Rack
UN  54.152.xxx.xxx  5.27 MB    256     47.4%             a562226a-c9f2-474f-9b86-46c3d2d3b212  us-east-1d
UN  54.225.xxx.xxx  8.32 GB    256     50.3%             0e0adf3d-4666-4aa4-ada7-4716e7c49ace  us-east-1e
UN  52.91.xxx.xxx   5.28 MB    256     49.7%             524320ba-b8be-494a-a9ce-c44c90555c51  us-east-1e
UN  54.224.xxx.xxx  3.85 GB    256     52.6%             1f9c6bef-e479-49e8-a1ea-b1d0d68257c7  us-east-1d

As you will see, I have doubled the size of the ring and the effective ownership is roughly 50% per server as expected (my replication factor is 2 copies in every region). However, owrringly you can see that some servers have absolutely no load on them (they are new), whilst others have excessive load on them (they are old and clearly no distribution of data has occurred).

Now this in itself is not the worry as I believe in the powers of Cassandra and its ability to eventually get the data in the right place. The thing that worries me immensely is that my table with exactly 5,000 rows now no longer has 5,000 rows in any of my three regions.

# From ap-southeast-1

cqlsh> CONSISTENCY ONE;
Consistency level set to ONE.

cqlsh> select count(*) from health_check_data_consistency;

 count
-------
  3891

cqlsh> CONSISTENCY LOCAL_QUORUM;
Consistency level set to LOCAL_QUORUM.

cqlsh> select count(*) from health_check_data_consistency;

 count
-------
  4633


# From eu-west-1

cqlsh> CONSISTENCY ONE;
Consistency level set to ONE.

cqlsh> select count(*) from health_check_data_consistency;

 count
-------
  1975

cqlsh> CONSISTENCY LOCAL_QUORUM;
Consistency level set to LOCAL_QUORUM.

cqlsh> select count(*) from health_check_data_consistency;

 count
-------
  4209


# From us-east-1

cqlsh> CONSISTENCY ONE;
Consistency level set to ONE.

cqlsh> select count(*) from health_check_data_consistency;

 count
-------
  4435

cqlsh> CONSISTENCY LOCAL_QUORUM;
Consistency level set to LOCAL_QUORUM.

cqlsh> select count(*) from health_check_data_consistency;

 count
-------
  4870

So seriously, what is going on here? Lets recap:

my replication factor is 'ap-southeast-1-A': '2', 'eu-west-1-A': '2', 'us-east-1-A': '2' so every region should be able to satisfy a query in full.
Bringing on new instances should not cause me to have data loss, yet apparently we do even with LOCAL_QUORUM
Every region has a different view on the data yet I have not introduced any new data, only new servers that then bootstrap automatically.

So then I thought, why not do a QUORUM query across the entire multi-region cluster. Unfortunately that fails completely:

cqlsh> CONSISTENCY QUORUM;
Consistency level set to QUORUM.

cqlsh> select count(*) from health_check_data_consistency;
OperationTimedOut: errors={}, last_host=172.17.0.2

I then turned TRACING ON; and that failed too. All I can see in the logs is the following:

INFO  [SlabPoolCleaner] 2016-03-03 19:16:16,616  ColumnFamilyStore.java:1197 - Flushing largest CFS(Keyspace='system_traces', ColumnFamily='events') to free up room. Used total: 0.33/0.00, live: 0.33/0.00, flushing: 0.00/0.00, this: 0.02/0.02
INFO  [SlabPoolCleaner] 2016-03-03 19:16:16,617  ColumnFamilyStore.java:905 - Enqueuing flush of events: 5624218 (2%) on-heap, 0 (0%) off-heap
INFO  [MemtableFlushWriter:1126] 2016-03-03 19:16:16,617  Memtable.java:347 - Writing Memtable-events@732346653(1.102MiB serialized bytes, 25630 ops, 2%/0% of on/off-heap limit)
INFO  [MemtableFlushWriter:1126] 2016-03-03 19:16:16,821  Memtable.java:382 - Completed flushing /var/lib/cassandra/data/system_traces/events/system_traces-events-tmp-ka-3-Data.db (298.327KiB) for commitlog position ReplayPosition(segmentId=1456854950580, position=28100666
)
INFO  [ScheduledTasks:1] 2016-03-03 19:16:21,210  MessagingService.java:929 - _TRACE messages were dropped in last 5000 ms: 212 for internal timeout and 0 for cross node timeout

This happens on every single server I run the query on.

Checking the cluster, it seems everything is in sync

$ nodetool describecluster;
Cluster Information:
    Name: Ably
    Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
    Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
    Schema versions:
            51e57d47-8870-31ca-a2cd-3d854e449687: [54.78.xxx.xxx, 54.152.xxx.xxx, 54.254.xxx.xxx, 54.255.xxx.xxx, 54.195.xxx.xxx, 54.194.xxx.xxx, 54.225.xxx.xxx, 52.91.xxx.xxx, 54.229.xxx.xxx, 54.169.xxx.xxx, 54.224.xxx.xxx, 54.255.xxx.xxx]

FURTHER UPDATE 1 hour later

Someone suggested that perhaps this was simply down to range queries not working as expected. I thus wrote a simple script that queried for each of the 5k rows individually (they have an ID range 1->5,000). Unfortunately the results are as I feared, I have missing data. I have tried this with LOCAL_ONE, LOCAL_QUORUM and event QUORUM.

ruby> (1..5000).each { |id| put "#{id} missing" if session.execute("select id from health_check_data_consistency where id = #{id}", consistency: :local_quorum).length == 0 }
19 missing, 61 missing, 84 missing, 153 missing, 157 missing, 178 missing, 248 missing, 258 missing, 323 missing, 354 missing, 385 missing, 516 missing, 538 missing, 676 missing, 708 missing, 727 missing, 731 missing, 761 missing, 863 missing, 956 missing, 1006 missing, 1102 missing, 1121 missing, 1161 missing, 1369 missing, 1407 missing, 1412 missing, 1500 missing, 1529 missing, 1597 missing, 1861 missing, 1907 missing, 2005 missing, 2168 missing, 2207 missing, 2210 missing, 2275 missing, 2281 missing, 2379 missing, 2410 missing, 2469 missing, 2672 missing, 2726 missing, 2757 missing, 2815 missing, 2877 missing, 2967 missing, 3049 missing, 3070 missing, 3123 missing, 3161 missing, 3235 missing, 3343 missing, 3529 missing, 3533 missing, 3830 missing, 4016 missing, 4030 missing, 4084 missing, 4118 missing, 4217 missing, 4225 missing, 4260 missing, 4292 missing, 4313 missing, 4337 missing, 4399 missing, 4596 missing, 4632 missing, 4709 missing, 4786 missing, 4886 missing, 4934 missing, 4938 missing, 4942 missing, 5000 missing

As you can see from above, that means I have roughly 1.5% of my data no longer available.

So I am stumped. I really need some advice here because I was certainly under the impression that Cassandra was specifically designed to handle scaling out horizontally on demand. Any help greatly appreciated.

回答1:

Regarding ownership. This is based on token ownership, not actual data. So the reported ownership in each case looks correct regardless of data volume on each node.

Second, you can’t guarantee consistency with two nodes (unless you sacrifice availability and use CL=ALL). QUORUM = majority. You need at least three nodes per DC to truly get quorum. If consistency is important to you deploy three nodes per DC and do QUORUM reads and writes.

SELECT count(*) across DCs is going to time out. There’s probably several hundred ms of latency between your us and ap datacenters. Plus select count(*) is an expensive operation.

When you do a QUORUM read Cassandra is going to fix inconsistent data with a read repair. That’s why your counts are accurate after you run the query at quorum.

All that being said, you do seem to have a bootstrap problem because new nodes aren’t getting all of the data. First I’d run a repair on all the nodes and make sure they all have 5,000 records after doing so. That’ll let you know streaming isn’t broken. Then repeat the node replace like you did before. This time monitor with nodetool netstats and watch the logs. Post anything strange. And don’t forget you have to run nodetool cleanup to remove data from the old nodes.

Can you describe your hardware config (RAM, CPU, disk, etc.)?

回答2:

What I should have said is you can't guarantee consistency AND availability. Since your quorum query is essentially an ALL query. The only way to query when one of the nodes is down would be to lower CL. And that won't do a read repair if data on the available node is inconsistent.

After running repair you also need to run cleanup on the old nodes to remove the data they no longer own. Also, repair won't remove deleted/TTLd data until after the gc_grace_seconds period. So if you have any of that, it'll stick around for at least gc_grace_seconds.

Did you find anything in the logs? Can you share your configuration?

来源：https://stackoverflow.com/questions/35752291/local-one-and-unexpected-data-replication-with-cassandra

标签

cassandra

datastax-enterprise

datastax-startup