I\'ve read quite a few articles and a lot of question/answers on SO about Cassandra but I still can\'t figure out how Cassandra decides which node(s) to go to when it\'s rea
Client sends the data to a random node
It might seem that way, but there is actually a non-random way that your driver picks a node to talk to. This node is called a "coordinator node" and is typically chosen based-on having the least (closest) "network distance." Client requests can really be sent to any node, and at first they will be sent to the nodes which your driver knows about. But once it connects and understands the topology of your cluster, it may change to a "closer" coordinator.
The nodes in your cluster exchange topology information with each other using the Gossip Protocol. The gossiper runs every second, and ensures that all nodes are kept current with data from whichever Snitch you have configured. The snitch keeps track of which data centers and racks each node belongs to.
In this way, the coordinator node also has data about which nodes are responsible for each token range. You can see this information by running a nodetool ring
from the command line. Although if you are using vnodes, that will be trickier to ascertain, as data on all 256 (default) virtual nodes will quickly flash by on the screen.
So let's say that I have a table that I'm using to keep track of ship crew members by their first name, and let's assume that I want to look-up Malcolm Reynolds. Running this query:
SELECT token(firstname),firstname, id, lastname
FROM usersbyfirstname WHERE firstname='Mal';
...returns this row:
token(firstname) | firstname | id | lastname
----------------------+-----------+----+-----------
4016264465811926804 | Mal | 2 | Reynolds
By running a nodetool ring
I can see which node is responsible for this token:
192.168.1.22 rack1 Up Normal 348.31 KB 3976595151390728557
192.168.1.22 rack1 Up Normal 348.31 KB 4142666302960897745
Or even easier, I can use nodetool getendpoints
to see this data:
$ nodetool getendpoints stackoverflow usersbyfirstname Mal
Picked up JAVA_TOOL_OPTIONS: -javaagent:/usr/share/java/jayatanaag.jar
192.168.1.22
For more information, check out some of the items linked above, or try running nodetool gossipinfo
.
Cassandra will locate any data based on a partition key that is mapped to a token value by the partitioner. Tokens are part of a finite token ring value range where each part of the ring is owned by a node in the cluster. The node owning the range of a certain token is said to be the primary for that token. Replicas will be selected by the data replication strategy. Basically this works by going clockwise in the token ring, starting from the primary, and stopping depending on the number of required replicas.
What's important to realize is that each node in the cluster is able to identify the nodes responsible for a certain key based on the logic described above. Whenever a value is written to the cluster, the node accepting the request (the coordinator node) will know right away the nodes that need to execute the write.
In case of multiple data-centers, all keys will be mapped across all DCs to the exact same token determined by the partitioner. Cassandra will try to write to each DC and each DC's replicas.
Cassandra uses consistent hashing to map each partition key to a token value. Each node owns ranges of token values as its primary range, so that every possible hash value will map to one node. Extra replicas are then kept in a systematic way (such as the next node in the ring) and stored in the nodes as their secondary range.
Every node in the cluster knows the topology of the entire cluster, such as which nodes are in which data center, where they are in the ring, and which token ranges each nodes owns. The nodes get and maintain this information using the gossip protocol.
When a read request comes in, the node contacted becomes the coordinator for the read. It will calculate which nodes have replicas for the requested partition, and then pick the required number of nodes to meet the consistency level. It will then send requests to those nodes and wait for their responses and merge the results based on the column timestamps before sending the result back to the client.