可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
In "Cassandra The Definitive Guide" (2nd edition) by Jeff Carpenter & Eben Hewitt, the following formula is used to calculate the size of a table on disk (apologies for the blurred part):
- ck: primary key columns
- cs: static columns
- cr: regular columns
- cc: clustering columns
- Nr: number of rows
- Nv: it's used for counting the total size of the timestamps (I don't get this part completely, but for now I'll ignore it).
There are two things I don't understand in this equation.
First: why do clustering columns size gets counted for every regular column? Shouldn't we multiply it by the number of rows? It seems to me that by calculating this way, we're saying that the data in each clustering column, gets replicated for each regular column, which I suppose is not the case.
Second: why do primary key columns don't get multiplied by the number of partitions? From my understanding, if we have a node with two partitions, then we should multiply the size of the primary key columns by two because we'll have two different primary keys in that node.
回答1:
It's because of Cassandra's version < 3 internal structure.
- There is only one entry for each distinct partition key value.
- For each distinct partition key value there is only one entry for static column
- There is an empty entry for the clustering key
- For each column in a row there is a single entry for each clustering key column
Let's take an example :
CREATE TABLE my_table ( pk1 int, pk2 int, ck1 int, ck2 int, d1 int, d2 int, s int static, PRIMARY KEY ((pk1, pk2), ck1, ck2) );
Insert some dummy data :
pk1 | pk2 | ck1 | ck2 | s | d1 | d2 -----+-----+-----+------+-------+--------+--------- 1 | 10 | 100 | 1000 | 10000 | 100000 | 1000000 1 | 10 | 100 | 1001 | 10000 | 100001 | 1000001 2 | 20 | 200 | 2000 | 20000 | 200000 | 2000001
Internal structure will be :
|100:1000: |100:1000:d1|100:1000:d2|100:1001: |100:1001:d1|100:1001:d2| -----+-------+-----------+-----------+-----------+-----------+-----------+-----------+ 1:10 | 10000 | | 100000 | 1000000 | | 100001 | 1000001 | |200:2000: |200:2000:d1|200:2000:d2| -----+-------+-----------+-----------+-----------+ 2:20 | 20000 | | 200000 | 2000000 |
So size of the table will be :
Single Partition Size = (4 + 4 + 4 + 4) + 4 + 2 * ((4 + (4 + 4)) + (4 + (4 + 4))) byte = 68 byte Estimated Table Size = Single Partition Size * Number Of Partition = 68 * 2 byte = 136 byte
- Here all of the field type is int (4 byte)
- There is 4 primary key column, 1 static column, 2 clustering key column and 2 regular column
More : http://opensourceconnections.com/blog/2013/07/24/understanding-how-cql3-maps-to-cassandras-internal-data-structure/
回答2:
As the author, I greatly appreciate the question and your engagement with the material!
With respect to the original questions - remember that this is not the formula to calculate the size of the table, it is the formula to calculate the size of a single partition. The intent is to use this formula with "worst case" number of rows to identify overly large partitions. You'd need to multiply the result of this equation by the number of partitions to get an estimate of total data size for the table. And of course this does not take replication into account.
Also thanks to those who responded to the original question. Based on your feedback I spent some time looking at the new (3.0) storage format to see whether that might impact the formula. I agree that Aaron Morton's article is a helpful resource (link provided above).
The basic approach of the formula remains sound for the 3.0 storage format. The way the formula works, you're basically adding:
- the sizes of the partition key and static columns
- the size of the clustering columns per row, times the number of rows
- 8 bytes of metadata for each cell
Updating the formula for the 3.0 storage format requires revisiting the constants. For example, the original equation assumes 8 bytes of metadata per cell to store a timestamp. The new format treats the timestamp on a cell as optional since it can be applied at the row level. For this reason, there is now a variable amount of metadata per cell, which could be as low as 1-2 bytes, depending on the data type.
After reading this feedback and rereading that section of the chapter, I plan to update the text to add some clarifications as well as stronger caveats about this formula being useful as an approximation rather than an exact value. There are factors it doesn't account for at all such as writes being spread over multiple SSTables, as well as tombstones. We're actually planning another printing this spring (2017) to correct a few errata, so look for those changes soon.
回答3:
Here is the updated formula from Artem Chebotko:

The t_avg is the average amount of metadata per cell, which can vary depending on the complexity of the data, but 8 is a good worst case estimate.