Why does a dictionary use so much RAM in Python

时间秒杀一切 提交于 2019-12-01 06:30:24

My guess is you have multiple copies on your dictionnary simultaneously stored in memory (on various format). As an example, the line:

tweets = map(json.loads, tweet_file) #creates a list of dictionaries (one per tweet)

Will create a new copy (+400~1000MB incl. dictionary the overhead). But your original tweet_file stay in memory. Why such big numbers? Well, if you work with Unicode strings, each Unicode character use 2 or 4 bytes in memory. Whereas on your file, assuming UTF-8 encoding, most of the characters use only 1 byte. If you working with plain strings in Python 2 the size of the string in memory should be almost the same as the size on the disk. So you will have to find an other explanation.

EDIT: The actual number of bytes occupied by a "character" in Python 2 may vary. Here are some example:

>>> import sys
>>> sys.getsizeof("")
40
>>> sys.getsizeof("a")
41
>>> sys.getsizeof("ab")
42

As you see, it appears that each character is encoded as one byte. But:

>>> sys.getsizeof("à")
42

Not for "French" characters. And ...

>>> sys.getsizeof("世")
43
>>> sys.getsizeof("世界")
46

For Japanese, we have 3 bytes per character.

The above results are site dependent -- and are explained by the fact that my system use 'UTF-8' a default encoding. The "size of the string" calculated just above are in fact the "size of the byte string" representing the given text.

If 'json.load' use "unicode" strings, the result are somehow different:

>>> sys.getsizeof(u"")
52
>>> sys.getsizeof(u"a")
56
>>> sys.getsizeof(u"ab")
60
>>> sys.getsizeof(u"世")
56
>>> sys.getsizeof(u"世界")
60

In that case, as you can see, each extra character add 4 extra bytes.


Maybe file object will cache some data? If you want to trigger explicit dellaocation of an object, try to set its reference to None:

tweets = map(json.loads, tweet_file) #creates a list of dictionaries (one per tweet)
[...]
tweet_file.close()
tweet_file = None

When there is not longer any reference to an object, Python will dellocate it -- and so free the corresponding memory (from the Python heap -- I don't think the memory is returned to the system).

I wrote a quick test script to confirm your results...

import sys
import os
import json
import resource

def get_rss():
    return resource.getrusage(resource.RUSAGE_SELF).ru_maxrss * 1024

def getsizeof_r(obj):
    total = 0
    if isinstance(obj, list):
        for i in obj:
            total += getsizeof_r(i)
    elif isinstance(obj, dict):
        for k, v in obj.iteritems():
            total += getsizeof_r(k) + getsizeof_r(v)
    else:
        total += sys.getsizeof(obj)
    return total

def main():
    start_rss = get_rss()
    filename = 'foo'
    f = open(filename, 'r')
    l = map(json.loads, f)
    f.close()
    end_rss = get_rss()

    print 'File size is: %d' % os.path.getsize(filename)
    print 'Data size is: %d' % getsizeof_r(l)
    print 'RSS delta is: %d' % (end_rss - start_rss)

if __name__ == '__main__':
    main()

...which prints...

File size is: 1060864
Data size is: 4313088
RSS delta is: 4722688

...so I'm only getting a four-fold increase, which would be accounted for by the fact that each Unicode char takes up four bytes of RAM.

Perhaps you could test your input file with this script, since I can't explain why you get an eight-fold increase with your script.

Have you considered the memory usage for the keys? If you have lots of small values in your dictionary, the storage for the keys could dominate.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!