Memory efficient (constant) and speed optimized iteration over a large table in Django

前端 未结 3 1056
感动是毒
感动是毒 2021-01-30 05:20

I have a very large table. It\'s currently in a MySQL database. I use django.

I need to iterate over each element of the table to pre-compute some parti

3条回答
  •  刺人心
    刺人心 (楼主)
    2021-01-30 05:43

    The essential answer: use raw SQL with server-side cursors.

    Sadly, until Django 1.5.2 there is no formal way to create a server-side MySQL cursor (not sure about other database engines). So I wrote some magic code to solve this problem.

    For Django 1.5.2 and MySQLdb 1.2.4, the following code will work. Also, it's well commented.

    Caution: This is not based on public APIs, so it will probably break in future Django versions.

    # This script should be tested under a Django shell, e.g., ./manage.py shell
    
    from types import MethodType
    
    import MySQLdb.cursors
    import MySQLdb.connections
    from django.db import connection
    from django.db.backends.util import CursorDebugWrapper
    
    
    def close_sscursor(self):
        """An instance method which replace close() method of the old cursor.
    
        Closing the server-side cursor with the original close() method will be
        quite slow and memory-intensive if the large result set was not exhausted,
        because fetchall() will be called internally to get the remaining records.
        Notice that the close() method is also called when the cursor is garbage 
        collected.
    
        This method is more efficient on closing the cursor, but if the result set
        is not fully iterated, the next cursor created from the same connection
        won't work properly. You can avoid this by either (1) close the connection 
        before creating a new cursor, (2) iterate the result set before closing 
        the server-side cursor.
        """
        if isinstance(self, CursorDebugWrapper):
            self.cursor.cursor.connection = None
        else:
            # This is for CursorWrapper object
            self.cursor.connection = None
    
    
    def get_sscursor(connection, cursorclass=MySQLdb.cursors.SSCursor):
        """Get a server-side MySQL cursor."""
        if connection.settings_dict['ENGINE'] != 'django.db.backends.mysql':
            raise NotImplementedError('Only MySQL engine is supported')
        cursor = connection.cursor()
        if isinstance(cursor, CursorDebugWrapper):
            # Get the real MySQLdb.connections.Connection object
            conn = cursor.cursor.cursor.connection
            # Replace the internal client-side cursor with a sever-side cursor
            cursor.cursor.cursor = conn.cursor(cursorclass=cursorclass)
        else:
            # This is for CursorWrapper object
            conn = cursor.cursor.connection
            cursor.cursor = conn.cursor(cursorclass=cursorclass)
        # Replace the old close() method
        cursor.close = MethodType(close_sscursor, cursor)
        return cursor
    
    
    # Get the server-side cursor
    cursor = get_sscursor(connection)
    
    # Run a query with a large result set. Notice that the memory consumption is low.
    cursor.execute('SELECT * FROM million_record_table')
    
    # Fetch a single row, fetchmany() rows or iterate it via "for row in cursor:"
    cursor.fetchone()
    
    # You can interrupt the iteration at any time. This calls the new close() method,
    # so no warning is shown.
    cursor.close()
    
    # Connection must be close to let new cursors work properly. see comments of
    # close_sscursor().
    connection.close()
    

提交回复
热议问题