问题
I have a data frame that I initialize out of scope of a local method. I would like to do as follows:
def outer_method():
... do outer scope stuff here
df = pd.DataFrame(columns=['A','B','C','D'])
def recursive_method(arg):
... do local stuff here
# func returns a data frame to be appended to empty data frame
results_df = func(args)
df.append(results_df, ignore_index=True)
return results
recursive_method(arg)
return df
However, this does NOT work. The df
is always empty if I append to it this way.
I found the answer to my problem here: appending-to-an-empty-data-frame-in-pandas... this works, IF the empty DataFrame object is in scope of the method, but not for my case. As per @DSM's comment "but the append doesn't happen in-place, so you'll have to store the output if you want it:"
IOW, I would need to have something like:
df = df.append(results_df, ignore_index=True)
in my local method, but this doesn't help me get access to my outer scope variable df to append to it.
Is there a way to make this happen in place? This works fine with the python extend
method for extending the contents of a list object (I realize DataFrames are not lists, but...). Is there an analogous way to do this with a DataFrame object without having to deal with my scoping issues for df
?
Btw, the Pandas concat
method also works, but I run into the issue of variable scope.
回答1:
In Python3, you could use the nonlocal keyword:
def outer_method():
... do outer scope stuff here
df = pd.DataFrame(columns=['A','B','C','D'])
def recursive_method(arg):
nonlocal df
... do local stuff here
# func returns a data frame to be appended to empty data frame
results_df = func(args)
df = df.append(results_df, ignore_index=True)
return results
return df
But note that calling df.append
returns a new DataFrame each time and thus requires copying all the old data into the new DataFrame. If you do this inside a loop N times, you end up making on the order of 1+2+3+...+N = O(N^2) copies -- very bad for performance.
If you do not need df
inside recursive_method
for any purpose other than
appending, it is better to append to a list, and then construct the
DataFrame (by calling pd.concat
once) after recursive_method
is done:
df = pd.DataFrame(columns=['A','B','C','D'])
data = [df]
def recursive_method(arg, data):
... do stuff here
# func returns a data frame to be appended to empty data frame
results_df = func(args)
data.append(df_join_out)
return results
recursive_method(arg, data)
df = pd.concat(data, ignore_index=True)
This is the best solution if all you need to do is collect data inside
recursive_method
and can wait to construct the new df
after
recursive_method
is done.
In Python2, if you must use df
inside recursive_method
, then you could pass
df
as argument to recursive_method
, and return df
too:
df = pd.DataFrame(columns=['A','B','C','D'])
def recursive_method(arg, df):
... do stuff here
results, df = recursive_method(arg, df)
# func returns a data frame to be appended to empty data frame
results_df = func(args)
df = df.append(results_df, ignore_index=True)
return results, df
results, df = recursive_method(arg, df)
but be aware that you will be paying a heavy price doing the O(N^2) copying mentioned above.
Why DataFrames can not should not be appended to in-place:
The underlying data in a DataFrame is stored in NumPy arrays. The data in a NumPy array comes from a contiguous block of memory. Sometimes there is not enough space to resize the NumPy arrays to a larger contigous block of memory even if memory is available -- imagine the array being sandwiched in between other data structures. In that case, in order to resize the array, a new larger block of memory has to be allocated somewhere else and all the data from the original array has to be copied to the new block. In general, it can't be done in-place.
DataFrames
do have a private method, _update_inplace
, which could be
used to redirect a DataFrame's underlying data to new data. This is only a
pseudo-inplace operation, since the new data (think NumPy arrays) has to be
allocated (with all the attendant copying) first. So using _update_inplace
has
two strikes against it: it uses a private method which (in theory) may not be
around in future versions of Pandas, and it incurs the O(N^2) copying penalty.
In [231]: df = pd.DataFrame([[0,1,2]])
In [232]: df
Out[232]:
0 1 2
0 0 1 2
In [233]: df._update_inplace(df.append([[3,4,5]]))
In [234]: df
Out[234]:
0 1 2
0 0 1 2
0 3 4 5
来源:https://stackoverflow.com/questions/35493517/issue-with-appending-to-dataframe-if-empty