问题
In continuation from this question I've implemented 2 functions doing the same thing, one is using re-indexing and the other does not. the functions differ in the 3rd line:
def update(centroid):
best_mean_dist = 200
clust_members = members_by_centeriod[centroid]
for member in clust_members:
member_mean_dist = 100 - df.ix[member].ix[clust_members].score.mean()
if member_mean_dist<best_mean_dist:
best_mean_dist = member_mean_dist
centroid = member
return centroid,best_mean_dist
def update1(centroid):
best_mean_dist = 200
members_in_clust = members_by_centeriod[centroid]
new_df = df.reindex(members_in_clust, level=0).reindex(members_in_clust, level=1)
for member in members_in_clust:
member_mean_dist = 100 - new_df.ix[member].ix[members_in_clust].score.mean()
if member_mean_dist<best_mean_dist:
best_mean_dist = member_mean_dist
centroid = member
return centroid,best_mean_dist
The functions are being called from an IPython notebook cell:
for centroid in centroids:
centroid = [update(centroid) for centroid in centroids]
The dataframe df is a large dataframe, with around 4 million rows and takes ~300MB in memory.
The update1 function using re-indexing is much faster. but, something unexpected happens - after just a few iterations when running the one with re-indexing the memory quickly goes up from ~300MB to 1.5GB and then I get memory violation.
The update function does not suffer from this kind of behavior. 2 things I'm not getting:
re-indexing makes a copy, that is obvious. but isn't that copy suppose to die each time the update1 function is finished? the
newdfvariable should die with the function creating it.. right?Even if the garbage collector is not killing
newdfright away, one memory runs out, it should kill it and not raise outOfMemory Exception, right?I tried killing df manually be adding
del newdfat the end of the update1 function, that didn't help. so might that indicate that the bug is actually in the re-indexing process itself?
EDIT:
I found the problem, but I cant understand what is the reason for this behavior. It is the python garbage collector, refusing to clean the reindexed dataframe. This is valid:
for i in range(2000):
new_df = df.reindex(clust_members, level=0).reindex(clust_members, level=1)
This is valid, too:
def reindex():
new_df = df.reindex(clust_members, level=0).reindex(clust_members, level=1)
score = 100 - new_df.ix[member].ix[clust_members].score.mean()
return score
for i in range(2000):
reindex()
This causes re-indexing object preservation in memory:
z = []
for i in range(2000):
z.append(reindex())
I think my usage is naively correct. how does the newdf variable stay connected to the score value, and why?
回答1:
Here is my debug code, when you do indexing, Index object will create _tuples and engine map, I think the memory is used by this two cache object. If I add the lines marked by ****, then the memory increase is very small, about 6M on my PC:
import pandas as pd
print pd.__version__
import numpy as np
import psutil
import os
import gc
def get_memory():
pid = os.getpid()
p = psutil.Process(pid)
return p.get_memory_info().rss
def get_object_ids():
return set(id(obj) for obj in gc.get_objects())
m1 = get_memory()
n = 2000
iy, ix = np.indices((n, n))
index = pd.MultiIndex.from_arrays([iy.ravel(), ix.ravel()])
values = np.random.rand(n*n, 3)
df = pd.DataFrame(values, index=index, columns=["a","b","c"])
ix = np.unique(np.random.randint(0, n, 500))
iy = np.unique(np.random.randint(0, n, 500))
m2 = get_memory()
objs1 = get_object_ids()
z = []
for i in range(5):
df2 = df.reindex(ix, level=0).reindex(iy, level=1)
z.append(df2.mean().mean())
df.index._tuples = None # ****
df.index._cleanup() # ****
del df2
gc.collect() # ****
m3 = get_memory()
print (m2-m1)/1e6, (m3-m2)/1e6
from collections import Counter
counter = Counter()
for obj in gc.get_objects():
if id(obj) not in objs1:
typename = type(obj).__name__
counter[typename] += 1
print counter
来源:https://stackoverflow.com/questions/21255234/dataframe-re-indexing-object-unnecessarily-preserved-in-memory