Is there a memory-efficient replacement of java.lang.String?

后端 未结 15 622
被撕碎了的回忆
被撕碎了的回忆 2020-11-30 19:29

After reading this old article measuring the memory consumption of several object types, I was amazed to see how much memory Strings use in Java:



        
相关标签:
15条回答
  • 2020-11-30 20:19

    Just compress them all with gzip. :) Just kidding... but I have seen stranger things, and it would give you much smaller data at significant CPU expense.

    The only other String implementations that I'm aware of are the ones in the Javolution classes. I don't think that they are more memory efficient, though:

    http://www.javolution.com/api/javolution/text/Text.html
    http://www.javolution.com/api/javolution/text/TextBuilder.html

    0 讨论(0)
  • 2020-11-30 20:24

    Java chose UTF-16 for a compromise of speed and storage size. Processing UTF-8 data is much more PITA than processing UTF-16 data (e.g. when trying to find the position of character X in the byte array, how are you going to do so in a fast manner, if every character can have one, two, three or even up to six bytes? Ever thought about that? Going over the string byte by byte is not really fast, you see?). Of course UTF-32 would be easiest to process, but waste twice the storage space. Things have changed since the early Unicode days. Now certain characters need 4 byte, even when UTF-16 is used. Handling these correctly make UTF-16 almost equally bad as UTF-8.

    Anyway, rest assured that if you implement a String class with an internal storage that uses UTF-8, you might win some memory, but you will lose processing speed for many string methods. Also your argument is a way too limited point of view. Your argument will not hold true for someone in Japan, since Japanese characters will not be smaller in UTF-8 than in UTF-16 (actually they will take 3 bytes in UTF-8, while they are only two bytes in UTF-16). I don't understand why programmers in such a global world like today with the omnipresent Internet still talk about "western languages", as if this is all that would count, as if only the western world has computers and the rest of it lives in caves. Sooner or later any application gets bitten by the fact that it fails to effectively process non-western characters.

    0 讨论(0)
  • 2020-11-30 20:28

    I'm currently implementing a compression method as follows (I'm working on an app that needs to store a very large number of documents in memory so we can do document-to-document computation):

    • Split up the string into 4-character "words" (if you need all Unicode) and store those bytes in a long using masking/bit shifting. If you don't need the full Unicode set and just the 255 ASCII characters, you can fit 8 characters into each long. Add (char) 0 to the end of the string until the length divides evenly by 4 (or 8).
    • Override a hash set implementation (like Trove's TLongHashSet) and add each "word" to that set, compiling an array of the internal indexes of where the long ends up in the set (make sure you also update your index when the set gets rehashed)
    • Use a two-dimensional int array to store these indexes (so the first dimension is each compressed string, and the second dimension is each "word" index in the hash set), and return the single int index into that array back to the caller (you have to own the word arrays so you can globally update the index on a rehash as mentioned above)

    Advantages:

    • Constant time compression/decompression
    • A length n string is represented as an int array of length n/4, with the additional overhead of the long word set which grows asymptotically as fewer unique "words" are encountered
    • The user is handed back a single int string "ID" which is convenient and small to store in their objects

    Distadvantages:

    • Somewhat hacky since it involves bit shifting, messing with the internals of the hash set, etc. (Bill K would not approve)
    • Works well when you don't expect a lot of duplicate strings. It's very expensive to check to see if a string already exists in the library.
    0 讨论(0)
  • 2020-11-30 20:28

    You said not to repeat the article's suggestion of rolling your own interning scheme, but what's wrong with String.intern itself? The article contains the following throwaway remark:

    Numerous reasons exist to avoid the String.intern() method. One is that few modern JVMs can intern large amounts of data.

    But even if the memory usage figures from 2002 still hold six years later, I'd be surprised if no progress has been made on how much data JVMs can intern.

    This isn't purely a rhetorical question - I'm interested to know if there are good reasons to avoid it. Is it implemented inefficiently for highly-multithreaded use? Does it fill up some special JVM-specific area of the heap? Do you really have hundreds of megabytes of unique strings (so interning would be useless anyway)?

    0 讨论(0)
  • 2020-11-30 20:29

    The article points out two things:

    1. Character arrays increase in chunks of 8 bytes.
    2. There is a large difference in size between char[] and String objects.

    The overhead is due to including a char[] object reference, and three ints: an offset, a length, and space for storing the String's hashcode, plus the standard overhead of simply being an object.

    Slightly different from String.intern(), or a character array used by String.substring() is using a single char[] for all Strings, this means you do not need to store the object reference in your wrapper String-like object. You would still need the offset, and you introduce a (large) limit on how many characters you can have in total.

    You would no longer need the length if you use a special end of string marker. That saves four bytes for the length, but costs you two bytes for the marker, plus the additional time, complexity, and buffer overrun risks.

    The space-time trade-off of not storing the hash may help you if you do not need it often.

    For an application that I've worked with, where I needed super fast and memory efficient treatment of a large number of strings, I was able to leave the data in its encoded form, and work with byte arrays. My output encoding was the same as my input encoding, and I didn't need to decode bytes to characters nor encode back to bytes again for output.

    In addition, I could leave the input data in the byte array it was originally read into - a memory mapped file.

    My objects consisted of an int offset (the limit suited my situation), an int length, and an int hashcode.

    java.lang.String was the familiar hammer for what I wanted to do, but not the best tool for the job.

    0 讨论(0)
  • 2020-11-30 20:31

    The UseCompressedStrings compiler option seems like the easiest route to take. If you're using strings only for storage, and not doing any equals/substring/split operations, then something like this CompactCharSequence class could work:

    http://www.javamex.com/tutorials/memory/ascii_charsequence.shtml

    0 讨论(0)
提交回复
热议问题