After reading this old article measuring the memory consumption of several object types, I was amazed to see how much memory Strings use in Java:
I'm currently implementing a compression method as follows (I'm working on an app that needs to store a very large number of documents in memory so we can do document-to-document computation):
long using masking/bit shifting. If you don't need the full Unicode set and just the 255 ASCII characters, you can fit 8 characters into each long. Add (char) 0 to the end of the string until the length divides evenly by 4 (or 8).TLongHashSet) and add each "word" to that set, compiling an array of the internal indexes of where the long ends up in the set (make sure you also update your index when the set gets rehashed)int array to store these indexes (so the first dimension is each compressed string, and the second dimension is each "word" index in the hash set), and return the single int index into that array back to the caller (you have to own the word arrays so you can globally update the index on a rehash as mentioned above)Advantages:
int array of length n/4, with the additional overhead of the long word set which grows asymptotically as fewer unique "words" are encounteredint string "ID" which is convenient and small to store in their objectsDistadvantages: