Is there a memory-efficient replacement of java.lang.String?

后端 未结 15 664
被撕碎了的回忆
被撕碎了的回忆 2020-11-30 19:29

After reading this old article measuring the memory consumption of several object types, I was amazed to see how much memory Strings use in Java:



        
15条回答
  •  日久生厌
    2020-11-30 20:28

    I'm currently implementing a compression method as follows (I'm working on an app that needs to store a very large number of documents in memory so we can do document-to-document computation):

    • Split up the string into 4-character "words" (if you need all Unicode) and store those bytes in a long using masking/bit shifting. If you don't need the full Unicode set and just the 255 ASCII characters, you can fit 8 characters into each long. Add (char) 0 to the end of the string until the length divides evenly by 4 (or 8).
    • Override a hash set implementation (like Trove's TLongHashSet) and add each "word" to that set, compiling an array of the internal indexes of where the long ends up in the set (make sure you also update your index when the set gets rehashed)
    • Use a two-dimensional int array to store these indexes (so the first dimension is each compressed string, and the second dimension is each "word" index in the hash set), and return the single int index into that array back to the caller (you have to own the word arrays so you can globally update the index on a rehash as mentioned above)

    Advantages:

    • Constant time compression/decompression
    • A length n string is represented as an int array of length n/4, with the additional overhead of the long word set which grows asymptotically as fewer unique "words" are encountered
    • The user is handed back a single int string "ID" which is convenient and small to store in their objects

    Distadvantages:

    • Somewhat hacky since it involves bit shifting, messing with the internals of the hash set, etc. (Bill K would not approve)
    • Works well when you don't expect a lot of duplicate strings. It's very expensive to check to see if a string already exists in the library.

提交回复
热议问题