dataframe re-indexing object unnecessarily preserved in memory

问题

In continuation from this question I've implemented 2 functions doing the same thing, one is using re-indexing and the other does not. the functions differ in the 3rd line:

def update(centroid):
    best_mean_dist = 200
    clust_members = members_by_centeriod[centroid]
    for member in clust_members:
        member_mean_dist = 100 - df.ix[member].ix[clust_members].score.mean()

        if member_mean_dist<best_mean_dist:
            best_mean_dist = member_mean_dist
            centroid = member
    return centroid,best_mean_dist

def update1(centroid):
    best_mean_dist = 200
    members_in_clust = members_by_centeriod[centroid]
    new_df = df.reindex(members_in_clust, level=0).reindex(members_in_clust, level=1)
    for member in members_in_clust:
        member_mean_dist = 100 - new_df.ix[member].ix[members_in_clust].score.mean()        

        if member_mean_dist<best_mean_dist:
            best_mean_dist = member_mean_dist
            centroid = member
    return centroid,best_mean_dist

The functions are being called from an IPython notebook cell:

for centroid in centroids:
    centroid = [update(centroid) for centroid in centroids]

The dataframe df is a large dataframe, with around 4 million rows and takes ~300MB in memory.

The update1 function using re-indexing is much faster. but, something unexpected happens - after just a few iterations when running the one with re-indexing the memory quickly goes up from ~300MB to 1.5GB and then I get memory violation.

The update function does not suffer from this kind of behavior. 2 things I'm not getting:

re-indexing makes a copy, that is obvious. but isn't that copy suppose to die each time the update1 function is finished? the newdf variable should die with the function creating it.. right?
Even if the garbage collector is not killing newdf right away, one memory runs out, it should kill it and not raise outOfMemory Exception, right?
I tried killing df manually be adding del newdf at the end of the update1 function, that didn't help. so might that indicate that the bug is actually in the re-indexing process itself?

EDIT:

I found the problem, but I cant understand what is the reason for this behavior. It is the python garbage collector, refusing to clean the reindexed dataframe. This is valid:

for i in range(2000):
   new_df = df.reindex(clust_members, level=0).reindex(clust_members, level=1)

This is valid, too:

def reindex():
    new_df = df.reindex(clust_members, level=0).reindex(clust_members, level=1)
    score  = 100 - new_df.ix[member].ix[clust_members].score.mean()
    return score

for i in range(2000):
    reindex()

This causes re-indexing object preservation in memory:

z = []    
for i in range(2000):
    z.append(reindex())

I think my usage is naively correct. how does the newdf variable stay connected to the score value, and why?

回答1:

Here is my debug code, when you do indexing, Index object will create _tuples and engine map, I think the memory is used by this two cache object. If I add the lines marked by ****, then the memory increase is very small, about 6M on my PC:

import pandas as pd
print pd.__version__
import numpy as np
import psutil
import os
import gc

def get_memory():
    pid = os.getpid()
    p = psutil.Process(pid)
    return p.get_memory_info().rss

def get_object_ids():
    return set(id(obj) for obj in gc.get_objects())

m1 = get_memory()

n = 2000
iy, ix = np.indices((n, n))
index = pd.MultiIndex.from_arrays([iy.ravel(), ix.ravel()])
values = np.random.rand(n*n, 3)
df = pd.DataFrame(values, index=index, columns=["a","b","c"])

ix = np.unique(np.random.randint(0, n, 500))
iy = np.unique(np.random.randint(0, n, 500))

m2 = get_memory()
objs1 = get_object_ids()

z = []
for i in range(5):
    df2 = df.reindex(ix, level=0).reindex(iy, level=1)
    z.append(df2.mean().mean())
df.index._tuples = None    # ****
df.index._cleanup()        # ****
del df2
gc.collect()               # ****
m3 = get_memory()

print (m2-m1)/1e6, (m3-m2)/1e6

from collections import Counter

counter = Counter()
for obj in gc.get_objects():
    if id(obj) not in objs1:
        typename = type(obj).__name__
        counter[typename] += 1
print counter

来源：https://stackoverflow.com/questions/21255234/dataframe-re-indexing-object-unnecessarily-preserved-in-memory

标签

python

pandas

ipython

ipython-notebook