问题
In continuation from this question I've implemented 2 functions doing the same thing, one is using re-indexing and the other does not. the functions differ in the 3rd line:
def update(centroid):
best_mean_dist = 200
clust_members = members_by_centeriod[centroid]
for member in clust_members:
member_mean_dist = 100 - df.ix[member].ix[clust_members].score.mean()
if member_mean_dist<best_mean_dist:
best_mean_dist = member_mean_dist
centroid = member
return centroid,best_mean_dist
def update1(centroid):
best_mean_dist = 200
members_in_clust = members_by_centeriod[centroid]
new_df = df.reindex(members_in_clust, level=0).reindex(members_in_clust, level=1)
for member in members_in_clust:
member_mean_dist = 100 - new_df.ix[member].ix[members_in_clust].score.mean()
if member_mean_dist<best_mean_dist:
best_mean_dist = member_mean_dist
centroid = member
return centroid,best_mean_dist
The functions are being called from an IPython notebook cell:
for centroid in centroids:
centroid = [update(centroid) for centroid in centroids]
The dataframe df
is a large dataframe, with around 4 million rows and takes ~300MB in memory.
The update1
function using re-indexing is much faster. but, something unexpected happens - after just a few iterations when running the one with re-indexing the memory quickly goes up from ~300MB to 1.5GB and then I get memory violation.
The update
function does not suffer from this kind of behavior. 2 things I'm not getting:
re-indexing makes a copy, that is obvious. but isn't that copy suppose to die each time the update1 function is finished? the
newdf
variable should die with the function creating it.. right?Even if the garbage collector is not killing
newdf
right away, one memory runs out, it should kill it and not raise outOfMemory Exception, right?I tried killing df manually be adding
del newdf
at the end of the update1 function, that didn't help. so might that indicate that the bug is actually in the re-indexing process itself?
EDIT:
I found the problem, but I cant understand what is the reason for this behavior. It is the python garbage collector, refusing to clean the reindexed dataframe. This is valid:
for i in range(2000):
new_df = df.reindex(clust_members, level=0).reindex(clust_members, level=1)
This is valid, too:
def reindex():
new_df = df.reindex(clust_members, level=0).reindex(clust_members, level=1)
score = 100 - new_df.ix[member].ix[clust_members].score.mean()
return score
for i in range(2000):
reindex()
This causes re-indexing object preservation in memory:
z = []
for i in range(2000):
z.append(reindex())
I think my usage is naively correct. how does the newdf
variable stay connected to the score value, and why?
回答1:
Here is my debug code, when you do indexing, Index object will create _tuples
and engine map
, I think the memory is used by this two cache object. If I add the lines marked by ****
, then the memory increase is very small, about 6M on my PC:
import pandas as pd
print pd.__version__
import numpy as np
import psutil
import os
import gc
def get_memory():
pid = os.getpid()
p = psutil.Process(pid)
return p.get_memory_info().rss
def get_object_ids():
return set(id(obj) for obj in gc.get_objects())
m1 = get_memory()
n = 2000
iy, ix = np.indices((n, n))
index = pd.MultiIndex.from_arrays([iy.ravel(), ix.ravel()])
values = np.random.rand(n*n, 3)
df = pd.DataFrame(values, index=index, columns=["a","b","c"])
ix = np.unique(np.random.randint(0, n, 500))
iy = np.unique(np.random.randint(0, n, 500))
m2 = get_memory()
objs1 = get_object_ids()
z = []
for i in range(5):
df2 = df.reindex(ix, level=0).reindex(iy, level=1)
z.append(df2.mean().mean())
df.index._tuples = None # ****
df.index._cleanup() # ****
del df2
gc.collect() # ****
m3 = get_memory()
print (m2-m1)/1e6, (m3-m2)/1e6
from collections import Counter
counter = Counter()
for obj in gc.get_objects():
if id(obj) not in objs1:
typename = type(obj).__name__
counter[typename] += 1
print counter
来源:https://stackoverflow.com/questions/21255234/dataframe-re-indexing-object-unnecessarily-preserved-in-memory