Is there a memory-efficient replacement of java.lang.String?

后端未结

关注

 15  666

After reading this old article measuring the memory consumption of several object types, I was amazed to see how much memory Strings use in Java:


                      
              相关标签:


      
      
        
          15条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  旧时难觅i        
                
              
                            
                2020-11-30 20:11
              
            
            
                                                                       
With a Little Bit of Help From the JVM...

WARNING: This solution is now obsolete in newer Java SE versions. See other ad-hoc solutions further below.

If you use an HotSpot JVM, since Java 6 update 21, you can use this command-line option:

-XX:+UseCompressedStrings


The JVM Options page reads:


  Use a byte[] for Strings which can be represented as pure ASCII. (Introduced
  in Java 6 Update 21 Performance Release)


UPDATE: This feature was broken in a later version and was supposed to be fixed again in Java SE 6u25 as mentioned by the 6u25 b03 release notes (however we don't see it in the 6u25 final release notes). The bug report 7016213 is not visible for security reasons. So, use with care and check first. Like any -XX option, it is deemed experimental and subject to change without much notice, so it's probably not always best to not use that in the startup scrip of a production server.

UPDATE 2013-03 (thanks to a comment by Aleksey Maximus): See this related question and its accepted answer. The option now seems to be deceased. This is further confirmed in the bug 7129417 report.

The End Justifies the Means

Warning: (Ugly) Solutions for Specific Needs

This is a bit out of the box and lower-level, but since you asked... don't hit the messenger!

Your Own Lighter String Representation

If ASCII is fine for you needs, then why don't you just roll out your own implementation?

As you mentioned, you could byte[] instead of char[] internally. But that's not all.

To do it even more lightweight, instead of wrapping your byte arrays in a class, why not simply use an helper class containing mostly static methods operating on these byte arrays that you pass around? Sure, it's going to feel pretty C-ish, but it would work, and would save you the huge overhead that goes with String objects.

And sure, it would miss some nice functionalities... unless your re-implement them. If you really need them, then there's not much choice. Thanks to OpenJDK and a lot of other good projects, you could very well roll out your own fugly LiteStrings class that just operate on byte[] parameters. You'll feel like taking a shower every time you need to call a function, but you'll have saved heaps of memory.

I'd recommend to make it resemble closely the String class's contract and to provide meaningful adapters and builders to convert from and to String, and you might want to also have adapters to and from StringBuffer and StringBuilder, as well as some mirror implementations of other things you might need. Definitely some piece of work, but might be worth it (see a bit below the "Make it Count!" section).

On-the-Fly Compression/Decompression

You could very well compress your strings in memory and decompress them on the fly when you need them. After all, you only need to be able to read them when you access them, right?

Of course, being that violent will mean:


more complex (thus less maintainable) code,
more processing power,
relatively long strings are needed for the compression to be relevant (or to compact multiple strings into one by implementing your own store system, to make the compression more effective).


Do Both

For a full-headache, of course you can do all of that:


C-ish helper class,
byte arrays,
on-the-fly compressed store.


Be sure to make that open-source. :)

Make it Count!

By the way, see this great presentation on Building Memory-Efficient Java Applications by N. Mitchell and G. Sevitsky: [2008 version], [2009 version].

From this presentation, we see that an 8-char string eats 64 bytes on a 32-bit system (96 for a 64-bit system!!), and most of it is due to JVM overhead. And from this article we see that an 8-byte array would eat "only" 24 bytes: 12 bytes of header, 8 x 1 byte + 4 bytes of alignment).

Sounds like this could be worth it if you really manipulate a lot of that stuff (and possibly speed up things a bit, as you'd spend less time allocating memory, but don't quote me on that and benchmark it; plus it would depend greatly on your implementation).
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  慢半拍i        
                
              
                            
                2020-11-30 20:11
              
            
            
                                                                       
I believe that Strings are less memory intensive for some time now, because the Java engineers have implemented the flyweight design pattern to share as much as possible.
In fact Strings that have the same value point to the very same object in memory I believe.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  名媛妹妹        
                
              
                            
                2020-11-30 20:12
              
            
            
                                                                       
At Terracotta, we have some cases where we compress big Strings as they are sent around the network and actually leave them compressed until decompression is necessary.  We do this by converting the char[] to byte[], compressing the byte[], then encoding that byte[] back into the original char[].  For certain operations like hash and length, we can answer those questions without decoding the compressed string.  For data like big XML strings, you can get substantial compression this way.

Moving the compressed data around the network is a definite win.  Keeping it compressed is dependent on the use case.  Of course, we have some knobs to turn this off and change the length at which compression turns on, etc.  

This is all done with byte code instrumentation on java.lang.String which we've found is very delicate due to how early String is used in startup but is stable if you follow some guidelines.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  梦如初夏        
                
              
                            
                2020-11-30 20:13
              
            
            
                                                                       
Today (2010), each GB you add to a server costs about £80 or $120.  Before you go re-engineering the String, you should ask yourself it is really worth  it.

If you are going to save a GB of memory, perhaps. Ten GB, definitiely. If you want to save 10s of MB, you are likely to use more time than its worth.

How you compact the Strings really depends on your usage pattern. Are there lots of repeated strings? (use an object pool) Are there lots of long strings? (use compression/encoding)

Another reason you might want smaller strings is to reduce cache usage.  Even the largest CPUs have about 8 MB - 12 MB of cache.  This can be a more precious resource and not easily increased.  In this case I suggest you look at alternatives to strings, but you must have in mind how much difference it will make in £ or $ against the time it takes.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  隐瞒了意图╮        
                
              
                            
                2020-11-30 20:13
              
            
            
                                                                       
Remember that there are many types of compression.  Using huffman encoding is a good general purpose approach - but it is relatively CPU intensive.  For a B+Tree implementation I worked on a few years back, we knew that the keys would likely have common leading characters, so we implemented a leading character compression algorithm for each page in the B+Tree.  The code was easy, very, very fast, and resulted in a memory usage 1/3 of what we started with.  In our case, the real reason for doing this was to save space on disk, and reduce time spent on disk -> RAM transfers (and that 1/3 savings made a huge difference in effective disk performance).

The reason that I bring this up is that a custom String implementation wouldn't have helped very much here.  We were only able to achieve the gains we did because we worked the layer of the container that the strings live in.

Trying to optimize a few bytes here and there inside the String object may not be worth it in comparison.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  情书的邮戳        
                
              
                            
                2020-11-30 20:18
              
            
            
                                                                       
An internal UTF-8 encoding has its advantages (such as the smaller memory footprint that you pointed out), but it has disadvantages too.

For example, determining the character-length (rather than the byte-length) of a UTF-8 encoded string is an O(n) operation. In a java string, the cost of determining the character-length is O(1), while generating the UTF-8 representation is O(n).

It's all about priorities.

Data-structure design can often be seen as a tradeoff between speed and space. In this case, I think the designers of the Java string API made a choice based on these criteria:


The String class must support all possible unicode characters.
Although unicode defines 1 byte, 2 byte, and 4-byte variants, the 4-byte characters are (in practice) pretty rare, so it's okay to represent them as surrogate pairs. That's why java uses a 2-byte char primitive.
When people call length(), indexOf(), and charAt() methods, they're interested in the character position, not the byte position. In order to create fast implementations of these methods, it's necessary to avoid the internal UTF-8 encoding.
Languages like C++ make the programmer's life more complicated by defining three different character types and forcing the programmer to choose between them. Most programmers start off using simple ASCII strings, but when they eventually need to support international characters, the process of modifying the code to use multibyte characters is extremely painful. I think the Java designers made an excellent compromise choice by saying that all strings consist of 2-byte characters.

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     1
2
3
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复