问题
I have a python script which is processing a large amount of data from compressed ASCII. After a short period, it runs out of memory. I am not constructing large lists or dicts. The following code illustrates the issue:
import struct
import zlib
import binascii
import numpy as np
import psutil
import os
import gc
process = psutil.Process(os.getpid())
n = 1000000
compressed_data = binascii.b2a_base64(bytearray(zlib.compress(struct.pack('%dB' % n, *np.random.random(n))))).rstrip()
print 'Memory before entering the loop is %d MB' % (process.get_memory_info()[0] / float(2 ** 20))
for i in xrange(2):
print 'Memory before iteration %d is %d MB' % (i, process.get_memory_info()[0] / float(2 ** 20))
byte_array = zlib.decompress(binascii.a2b_base64(compressed_data))
a = np.array(struct.unpack('%dB' % (len(byte_array)), byte_array))
gc.collect()
gc.collect()
print 'Memory after last iteration is %d MB' % (process.get_memory_info()[0] / float(2 ** 20))
It prints:
Memory before entering the loop is 45 MB
Memory before iteration 0 is 45 MB
Memory before iteration 1 is 51 MB
Memory after last iteration is 51 MB
Between the first and second iteration, 6 MB of memory get created. If i run the loop more than two times, the memory usage stays at 51 MB. If I put the code to decompress into its own function and feed it the actual compressed data, the memory usage will continue to grow. I am using Python 2.7. Why is the memory increasing and how can it be corrected? Thank you.
回答1:
Through comments, we figured out what was going on:
The main issue is that variables declared in a for loop are not destroyed once the loop ends. They remain accessible, pointing to the value they received in the last iteration:
>>> for i in range(5):
... a=i
...
>>> print a
4
So here's what's happening:
- First iteration: The
printis showing 45MB, which the memory before instantiatingbyte_arrayanda. - The code instantiates those two lengthy variables, making the memory go to 51MB
- Second iteration: The two variables instantiated in the first run of the loop are still there.
- In the middle of the second iteration,
byte_arrayandaare overwritten by the new instantiation. The initial ones are destroyed, but substituted by equally lengthy variables. - The
forloop ends, butbyte_arrayandaare still accessible in the code, therefore, not destroyed by the secondgc.collect()call.
Changing the code to:
for i in xrange(2):
[ . . . ]
byte_array = None
a = None
gc.collect()
made the memory resreved by byte_array and a unaccessible, and therefore, freed.
There's more on Python's garbage collection in this SO answer: https://stackoverflow.com/a/4484312/289011
Also, it may be worth looking at How do I determine the size of an object in Python?. This is tricky, though... if your object is a list pointing to other objects, what is the size? The sum of the pointers in the list? The sum of the size of the objects those pointers point to?
来源:https://stackoverflow.com/questions/27236967/python-memory-leak-using-binascii-zlib-struct-and-numpy