What is the byte size of common Cassandra data types - To be used when calculating partition disk usage?

一笑奈何 提交于 2019-12-04 05:28:33

I think, from a pragmatic point of view, that it is wise to get a back-of-the-envelope estimate of worst case using the formulae in the ds220 course up-front at design time. The effect of compression often varies depending on algorithms and patterns in the data. From ds220 and http://cassandra.apache.org/doc/latest/cql/types.html:

uuid: 16 bytes
timeuuid: 16 bytes
timestamp: 8 bytes
bigint: 8 bytes
counter: 8 bytes
double: 8 bytes
time: 8 bytes
inet: 4 bytes (IPv4) or 16 bytes (IPV6)
date: 4 bytes
float: 4 bytes
int 4 bytes
smallint: 2 bytes
tinyint: 1 byte
boolean: 1 byte (hopefully.. no source for this)
ascii: equires an estimate of average # chars * 1 byte/char
text/varchar: requires an estimate of average # chars * (avg. # bytes/char for language)
map/list/set/blob: an estimate

hope it helps

The only reliable way to estimate the overhead associated to something is to actually perform measures. Really, you can't take the single data types and generalize something about them. If you have 4 bigints columns and you're supposing that your overhead is X, if you have 400 bigint columns your overhead won't probably be 100x. That's because Cassandra compresses (by default, and it's a settings tunable per column family) everything before storing data on disk.

Try to load some data, I mean production data, in the cluster, and then let's know your results and compression configuration. You'd find some surprises.

Know your data.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!