Is there a memory-efficient replacement of java.lang.String?

后端 未结 15 673
被撕碎了的回忆
被撕碎了的回忆 2020-11-30 19:29

After reading this old article measuring the memory consumption of several object types, I was amazed to see how much memory Strings use in Java:



        
15条回答
  •  情书的邮戳
    2020-11-30 20:18

    An internal UTF-8 encoding has its advantages (such as the smaller memory footprint that you pointed out), but it has disadvantages too.

    For example, determining the character-length (rather than the byte-length) of a UTF-8 encoded string is an O(n) operation. In a java string, the cost of determining the character-length is O(1), while generating the UTF-8 representation is O(n).

    It's all about priorities.

    Data-structure design can often be seen as a tradeoff between speed and space. In this case, I think the designers of the Java string API made a choice based on these criteria:

    • The String class must support all possible unicode characters.

    • Although unicode defines 1 byte, 2 byte, and 4-byte variants, the 4-byte characters are (in practice) pretty rare, so it's okay to represent them as surrogate pairs. That's why java uses a 2-byte char primitive.

    • When people call length(), indexOf(), and charAt() methods, they're interested in the character position, not the byte position. In order to create fast implementations of these methods, it's necessary to avoid the internal UTF-8 encoding.

    • Languages like C++ make the programmer's life more complicated by defining three different character types and forcing the programmer to choose between them. Most programmers start off using simple ASCII strings, but when they eventually need to support international characters, the process of modifying the code to use multibyte characters is extremely painful. I think the Java designers made an excellent compromise choice by saying that all strings consist of 2-byte characters.

提交回复
热议问题