psycopg2 leaking memory after large query

后端 未结 3 1347
野性不改
野性不改 2020-12-24 07:55

I\'m running a large query in a python script against my postgres database using psycopg2 (I upgraded to version 2.5). After the query is finished, I close the cursor and co

相关标签:
3条回答
  • 2020-12-24 08:03

    I ran into a similar problem and after a couple of hours of blood, sweat and tears, found the answer simply requires the addition of one parameter.

    Instead of

    cursor = conn.cursor()
    

    write

    cursor = conn.cursor(name="my_cursor_name")
    

    or simpler yet

    cursor = conn.cursor("my_cursor_name")
    

    The details are found at http://initd.org/psycopg/docs/usage.html#server-side-cursors

    I found the instructions a little confusing in that I though I'd need to rewrite my SQL to include "DECLARE my_cursor_name ...." and then a "FETCH count 2000 FROM my_cursor_name" but it turns out psycopg does that all for you under the hood if you simply overwrite the "name=None" default parameter when creating a cursor.

    The suggestion above of using fetchone or fetchmany doesn't resolve the problem since, if you leave the name parameter unset, psycopg will by default attempt to load the entire query into ram. The only other thing you may need to to (besides declaring a name parameter) is change the cursor.itersize attribute from the default 2000 to say 1000 if you still have too little memory.

    0 讨论(0)
  • 2020-12-24 08:22

    Please see the next answer by @joeblog for the better solution.


    First, you shouldn't need all that RAM in the first place. What you should be doing here is fetching chunks of the result set. Don't do a fetchall(). Instead, use the much more efficient cursor.fetchmany method. See the psycopg2 documentation.

    Now, the explanation for why it isn't freed, and why that isn't a memory leak in the formally correct use of that term.

    Most processes don't release memory back to the OS when it's freed, they just make it available for re-use elsewhere in the program.

    Memory may only be released to the OS if the program can compact the remaining objects scattered through memory. This is only possible if indirect handle references are used, since otherwise moving an object would invalidate existing pointers to the object. Indirect references are rather inefficient, especially on modern CPUs where chasing pointers around does horrible things to performance.

    What usually lands up happening unless extra caution is exersised by the program is that each large chunk of memory allocated with brk() lands up with a few small pieces still in use.

    The OS can't tell whether the program considers this memory still in use or not, so it can't just claim it back. Since the program doesn't tend to access the memory the OS will usually swap it out over time, freeing physical memory for other uses. This is one of the reasons you should have swap space.

    It's possible to write programs that hand memory back to the OS, but I'm not sure that you can do it with Python.

    See also:

    • python - memory not being given back to kernel
    • Why doesn't memory get released to system after large queries (or series of queries) in django?
    • Releasing memory in Python

    So: this isn't actually a memory leak. If you do something else that uses lots of memory, the process shouldn't grow much if at all, it'll re-use the previously freed memory from the last big allocation.

    0 讨论(0)
  • 2020-12-24 08:24

    Joeblog has the correct answer. The way you deal with the fetching is important but far more obvious than the way you must define the cursor. Here is a simple example to illustrate this and give you something to copy-paste to start with.

    import datetime as dt
    import psycopg2
    import sys
    import time
    
    conPG = psycopg2.connect("dbname='myDearDB'")
    curPG = conPG.cursor('testCursor')
    curPG.itersize = 100000 # Rows fetched at one time from the server
    
    curPG.execute("SELECT * FROM myBigTable LIMIT 10000000")
    # Warning: curPG.rowcount == -1 ALWAYS !!
    cptLigne = 0
    for rec in curPG:
       cptLigne += 1
       if cptLigne % 10000 == 0:
          print('.', end='')
          sys.stdout.flush() # To see the progression
    conPG.commit() # Also close the cursor
    conPG.close()
    

    As you will see, dots came by group rapidly, than pause to get a buffer of rows (itersize), so you don't need to use fetchmany for performance. When I run this with /usr/bin/time -v, I get the result in less than 3 minutes, using only 200MB of RAM (instead of 60GB with client-side cursor) for 10 million rows. The server doesn't need more ram as it uses temporary table.

    0 讨论(0)
提交回复
热议问题