What are the pros or cons of storing json as text vs blob in cassandra?

One problem with blob for me is, in java, ByteBuffer (which is mapped to blob in cassandra) is not Serializable hence does not work well with EJBs.

Considering the json is fairly large what would be the better type for storing json in cassandra. Is it text or blob?

Does the size of the json matter when deciding the blob vs json?

If it were any other database like oracle, it's common to use blob/clob. But in Cassandra where each cell can hold as large as 2GB, does it matter?

Please consider this question as the choose between text vs blob for this case, instead of sorting to suggestions regarding whether to use single column for json.

I don't think there's any benefit for storing the literal JSON data as a BLOB in Cassandra. At best your storage costs are identical, and in general the API's are less convenient in terms of working with BLOB types as they are for working with strings/text.

For instance, if you're using their Java API then in order to store the data as a BLOB using a parameterized PreparedStatement you first need to load it all into a ByteBuffer, for instance by packing your JSON data into an InputStream.

Unless you're dealing with very large JSON snippets that force you to stream your data anyways, that's a fair bit of extra work to get access to the BLOB type. And what would you gain from it? Essentially nothing.

However, I think there's some merit in asking 'Should I store JSON as text, or gzip it and store the compressed data as a BLOB?'.

And the answer to that comes down to how you've configured Cassandra and your table. In particular, as long as you're using Cassandra version 1.1 or later your tables have compression enabled by default. That may be adequate, particularly if your JSON data is fairly uniform across each row.

However, Cassandra's built-in compression is applied table-wide, rather than to individual rows. So you may get a better compression ratio by manually compressing your JSON data before storage, writing the compressed bytes into a ByteBuffer, and then shipping the data into Cassandra as a BLOB.

So it essentially comes down to a tradeoff in terms of storage space vs. programming convenience vs. CPU usage. I would decide the matter as follows:

Is minimizing the amount of storage consumed your biggest concern?
- If yes, compress the JSON data and store the compressed bytes as a BLOB;
- Otherwise, proceed to #2.
Is Cassandra's built-in compression available and enabled for your table?
- If no (and if you can't enable the compression), compress the JSON data and store the compressed bytes as a BLOB;
- Otherwise, proceed to #3.
Is the data you'll be storing relatively uniform across each row?
- Probably for JSON data the answer is 'yes', in which case you should store the data as text and let Cassandra handle the compression;
- Otherwise proceed to #4.
Do you want efficiency, or convenience?
- Efficiency; compress the JSON data and store the compressed bytes as a BLOB.
- Convenience; compress the JSON data, base64 the compressed data, and then store the base64-encoded data as text.

Since the data is not binary there is really little reason to use a Binary Large OBject. Sure you can do it, but why? Text is easier to read for humans, and there isn't really a speed/size difference (.

Even in other DBs you can often store JSON as text. E.g. even MySQL has text fields that can handle quite bit of text (LONGTEXT = 4Gb). Yeah, Oracle is behind, but hopefully they will also get a reasonable long text field sometimes.

But why do you want to store a whole Json object as text? The json should really be normalized and stored as multiple fields in the DB.

I would definitely say that text would be better than a blob for storing JSON. JSON is ultimately text, so this type makes sense, but also there may be extra overhead for blobs as some of drivers seem to require that they be converted to Hex before inserting them. Also, blobs show up as base64-encoded strings when using cqlsh, so you wouldn't be able to easily check what JSON was actually stored if you needed to for testing purposes. I'm not sure exactly how blobs are stored on disk, but I'd imagine it's very similar to how text is.

With that said, storing large entries can cause problems and is not recommended. This can cause issues with sharding and consume a lot of memory. Although the FAQ refers to files over 64MB, from experience even files a few megabytes each on average can cause performance issues when you start storing a large number of them. If possible, it would be better to use an object store if you expect the JSON to be in the megabytes in size and store references to that store in Cassandra instead.

In the upcoming 2.2 release there is also native support in Cassandra for JSON. http://www.datastax.com/dev/blog/whats-new-in-cassandra-2-2-json-support

来源：https://stackoverflow.com/questions/31339150/what-are-the-pros-or-cons-of-storing-json-as-text-vs-blob-in-cassandra

标签

java

json

cassandra

blob