Force Python to release objects to free up memory

淺唱寂寞╮ 提交于 2020-05-12 05:43:13

问题


I am running the following code:

from myUtilities import myObject
for year in range(2006,2015):
    front = 'D:\\newFilings\\'
    back = '\\*\\dirTYPE\\*.sgml'
    path = front + str(year) + back
    sgmlFilings = glob.glob(path)
    for each in sgmlFilings:
        header = myObject(each)
        try:
            tagged = header.process_tagged('G:')
        except Exception as e:
            outref = open('D:\\ProblemFiles.txt','a')
            outref.write(each '\n')
            outref.close()
            print each

If I start from a reboot the memory allocation/consumption by python is fairly small. Over time though it increases significantly and ultimately after about a day I have very little free memory (24GB installed [294 mb free 23960 cached]) and the memory claimed by Python in the Windows Task Manager list is 3GB. I am watching this increase over the three days it takes to run the code against the file collection.

I was under the impression that since I am doing everything with

tagged = header.process_tagged('G:')

that the memory associated with each loop would be freed and garbage collected.

Is there something I can do to force the release of this memory. While I have not run statistics yet I can tell by watching activity on the disk that the process slows down as time (and the memory ~lump~ gets bigger) progresses

EDIT

I looked at the question referenced below and I do not think these are the same as the issue as I understand in the other question is that they are holding onto the objects (list of triangles) and need the entire list for computation. With each loop I am reading a file, performing some processing of the file and then writing it back out to disk. And then I am reading the next file . . .

With regards to possible memory leaks I am using LXML in myObject

Note, I added the line from MyUtilities import myObject since the first iteration of this question. MyUtilities holds the code that does everything

Regarding posting my code for myUtilities - that gets away from the basic question - I am done with header and tagged after each iteration tagged does stuff and writes the results to another drive, as a matter of fact a newly formatted drive.

I looked into using multiprocessing but I didn't because of a vague idea that since this is so I/O intensive that I would be competing for the drive heads - maybe that is wrong but since each iteration requires that I write a couple of hundred MB files I would think I would be competing for write and or even read time.

UPDATE - so I had one case in the myObjectclass where a file was opened with

myString = open(somefile).read()

I changed that to

with open(somefile,'r') as fHandle:

`    myString = fHandle.read()`

(sorry for the formatting - still struggling)

However, this had no apparent affect, When I started a new cycle I had 4000 mb of Cached memory, after 22 minutes and processing of 27K files I had roughly 26000 mb of Cached memory.

I appreciate all of the answers and comments below and have been reading up and testing various things all day. I will update this as I thought this task would take a week and now it looks like it might take over a month.

I keep getting questions about the rest of the code. However, it is over 800 lines and to me that sort of gets away from the central question

So an instance of myObject is created Then we apply methods contained in myObject to header

This is basically file transformation. a file is read in, and copies of parts of the file are made and written to disk.

The central question to me is that there is obviously some persistence with either header or tagged. How can I dispose of everything related to header or tagged before I start the next cycle.

I have been running the code now for the last 14 hours or so. When it went through the first cycle it took about 22 minutes to process 27K files, now it is taking an hour and a half to handle approximately the same number.

Just running gc.collect does not work. I stopped the program and tried that in the interpreter and I saw no movement in memory statistics.

EDIT after reading the memoryallocator description from below I am thinking that the amount tied up in the cache is not the problem - it is the amount tied up by the running python process. So new test is running the code from the command line. I am continuing to watch and monitor and will post more once I see what happens.

EDIT: still struggling but have set up the code to run from a bat file with data from one loop of sgmlFilings (see above) the batch file looks like this

python batch.py
python batch.py
 .
 .
 .

The batch.py starts by reading a queue file that has a list of directories to glob, it takes the first one off the list, updates the list and saves it and then it runs the header and tagged processes. Clumsy but since the python.exe is closed after each iteration python never accumulates memory and so the process is running at a consistent speed.


回答1:


The reason is CPython's memory management. The way Python manages memory make it hard for long running programs. When you explicitly free an object with del statement, CPython necessarily does not return allocated memory to the OS. It keeps the memory for further use in future. One way to work this problem around is to use Multiprocessing module and kill the process after you are done with the job and create another one. This way you free the memory by force and OS must free up the memory used by that child process. I have had exact same problem. Memory usage excessively increased over time to the point which system became unstable and unresponsive. I used a different technique with signals and psutil to work it around. This problem occurs commonly when you have a loop and need to allocate and deallocate data on a stack, for example.

You can read more about Python memory allocator here : http://www.evanjones.ca/memoryallocator/

This tool is also very helpful to profile memory usage : https://pypi.python.org/pypi/memory_profiler

One more thing, add slots to myObject, it seems you have fixed slots inside your object, this also helps with reducing ram usage. Objects without slots specified allocates more ram to take care of dynamic attributes you may add to them later : http://tech.oyster.com/save-ram-with-python-slots/




回答2:


You can force a garbage collection using the gc module. In particular the gc.collect() function.

However this may not solve your problem since, probably, the gc is running but you are either using a library/code that contains a memory leak, or the library/code is keeping some references somewhere. In any case I doubt that the gc is the issue here.

Sometimes you may have some code that is keeping alive references to objects that you want need. In such a case you can consider explicitly deleting them when they aren't needed anymore, however this doesn't seem the case.


Also keep in mind that the memory usage of the python process may actually be a lot smaller than reported by the OS. In particular calls to free() need not return the memory to the OS (usually this doesn't happen when performing small allocations) so what you see may be the highest peak of memory usage up to the point, not the current usage. Add to this the fact that Python uses an other layer of memory allocation on top of C's one, and this make it pretty hard to profile memory usage. However it the memory keeps going up and up this probably isn't the case.

You should use something like Guppy to profile memory usage.




回答3:


You have some measure of control over this stuff using the gc module. Specifically, you might try incorporating

gc.collect() 

in your loop's body.




回答4:


Before resorting to forced garbage collection (never a good idea); try something basic:

  1. Use glob.iglob, (a generator) instead of getting a list of all your files at once.

  2. In your myObject(each) method, make sure you are closing the file or use the with statement so it is automatically closed; otherwise it will remain in memory eating up space.

  3. Don't open and close files; just open the file once for writing in your exception handler.

As you haven't posted the actual code that's doing the processing (and thus, possibly the cause of your memory woes), it is difficult to recommend specifics.



来源:https://stackoverflow.com/questions/31089451/force-python-to-release-objects-to-free-up-memory

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!