Why does a dictionary use so much RAM in Python

问题

I have written a python script that read the contents of two files, the first is a relatively small file (~30KB) and the second is a larger file ~270MB. The contents of both files are loaded into a dictionary data structure. When the second file is loaded I would have expected the amount of RAM required to be roughly equivalent to the size of the file on disk, perhaps with some overhead, but watching the RAM usage on my PC it seems to consistently take ~2GB (around 8 times the size of the file). The relevant source code is below (pauses inserted just so I can see the RAM usage at each stage). The line consuming large amounts of memory is "tweets = map(json.loads, tweet_file)":

def get_scores(term_file):
    global scores
    for line in term_file:
        term, score  = line.split("\t") #tab character
        scores[term] = int(score)

def pause():
    tmp = raw_input('press any key to continue: ')

def main():
    # get terms and their scores..
    print 'open word list file ...'
    term_file = open(sys.argv[1])
    pause()
    print 'create dictionary from word list file ...'
    get_scores(term_file)
    pause()
    print 'close word list file ...'
    term_file.close
    pause()

    # get tweets from file...
    print 'open tweets file ...'
    tweet_file = open(sys.argv[2])
    pause()
    print 'create dictionary from word list file ...'
    tweets = map(json.loads, tweet_file) #creates a list of dictionaries (one per tweet)
    pause()
    print 'close tweets file ...'
    tweet_file.close
    pause()

Does anyone know why this is? My concern is that I would like to extend my research to larger files, but will fast run out of memory. Interestingly, the memory usage does not seem to increase noticeably after opening the file (as I think this just creates a pointer).

I have an idea to try looping through the file one line at a time and processing what I can and only storing the minimum that I need for future reference rather than loading everything into a list of dictionaries, but I was just interested to see if the approx 8 times multiplier on file size to memory when creating a dictionary is in line with other peoples experience?

回答1:

My guess is you have multiple copies on your dictionnary simultaneously stored in memory (on various format). As an example, the line:

tweets = map(json.loads, tweet_file) #creates a list of dictionaries (one per tweet)

Will create a new copy (+400~1000MB incl. dictionary the overhead). But your original tweet_file stay in memory. Why such big numbers? Well, if you work with Unicode strings, each Unicode character use 2 or 4 bytes in memory. Whereas on your file, assuming UTF-8 encoding, most of the characters use only 1 byte. If you working with plain strings in Python 2 the size of the string in memory should be almost the same as the size on the disk. So you will have to find an other explanation.

EDIT: The actual number of bytes occupied by a "character" in Python 2 may vary. Here are some example:

>>> import sys
>>> sys.getsizeof("")
40
>>> sys.getsizeof("a")
41
>>> sys.getsizeof("ab")
42

As you see, it appears that each character is encoded as one byte. But:

>>> sys.getsizeof("à")
42

Not for "French" characters. And ...

>>> sys.getsizeof("世")
43
>>> sys.getsizeof("世界")
46

For Japanese, we have 3 bytes per character.

The above results are site dependent -- and are explained by the fact that my system use 'UTF-8' a default encoding. The "size of the string" calculated just above are in fact the "size of the byte string" representing the given text.

If 'json.load' use "unicode" strings, the result are somehow different:

>>> sys.getsizeof(u"")
52
>>> sys.getsizeof(u"a")
56
>>> sys.getsizeof(u"ab")
60
>>> sys.getsizeof(u"世")
56
>>> sys.getsizeof(u"世界")
60

In that case, as you can see, each extra character add 4 extra bytes.

Maybe file object will cache some data? If you want to trigger explicit dellaocation of an object, try to set its reference to None:

tweets = map(json.loads, tweet_file) #creates a list of dictionaries (one per tweet)
[...]
tweet_file.close()
tweet_file = None

When there is not longer any reference to an object, Python will dellocate it -- and so free the corresponding memory (from the Python heap -- I don't think the memory is returned to the system).

回答2:

I wrote a quick test script to confirm your results...

import sys
import os
import json
import resource

def get_rss():
    return resource.getrusage(resource.RUSAGE_SELF).ru_maxrss * 1024

def getsizeof_r(obj):
    total = 0
    if isinstance(obj, list):
        for i in obj:
            total += getsizeof_r(i)
    elif isinstance(obj, dict):
        for k, v in obj.iteritems():
            total += getsizeof_r(k) + getsizeof_r(v)
    else:
        total += sys.getsizeof(obj)
    return total

def main():
    start_rss = get_rss()
    filename = 'foo'
    f = open(filename, 'r')
    l = map(json.loads, f)
    f.close()
    end_rss = get_rss()

    print 'File size is: %d' % os.path.getsize(filename)
    print 'Data size is: %d' % getsizeof_r(l)
    print 'RSS delta is: %d' % (end_rss - start_rss)

if __name__ == '__main__':
    main()

...which prints...

File size is: 1060864
Data size is: 4313088
RSS delta is: 4722688

...so I'm only getting a four-fold increase, which would be accounted for by the fact that each Unicode char takes up four bytes of RAM.

Perhaps you could test your input file with this script, since I can't explain why you get an eight-fold increase with your script.

回答3:

Have you considered the memory usage for the keys? If you have lots of small values in your dictionary, the storage for the keys could dominate.

来源：https://stackoverflow.com/questions/17313381/why-does-a-dictionary-use-so-much-ram-in-python

标签

python

json

dictionary

ram