问题
I'm trying to pickle objects that inherit from pandas.DataFrame. The attribute I add to the dataframe disappears during the pickling/unpickling process. There are some obvious workarounds, but... am I doing something wrong, or is this a bug?
import pandas as pd
import pickle
class Foo(pd.DataFrame):
def __init__(self,tag,df):
super().__init__(df)
self._tag = tag
foo = Foo('mytag', pd.DataFrame({'a':[1,2,3],'b':[4,5,6]}))
print(foo)
print(foo._tag)
print("-------------------------------------")
with open("foo.pkl", "wb") as pkl:
pickle.dump(foo, pkl)
with open("foo.pkl", "rb") as pkl:
foo1 = pickle.load(pkl)
print(type(foo1))
print(foo1)
print(foo1._tag)
Here is my output:
a b
0 1 4
1 2 5
2 3 6
mytag
-------------------------------------
<class '__main__.Foo'>
a b
0 1 4
1 2 5
2 3 6
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-20-1e7e89e199c8> in <module>
21 print(type(foo1))
22 print(foo1)
---> 23 print(foo1._tag)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
5065 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5066 return self[name]
-> 5067 return object.__getattribute__(self, name)
5068
5069 def __setattr__(self, name, value):
AttributeError: 'Foo' object no attribute '_tag'
(python 3.7, pandas 0.24.2, pickle.format_version 4.0)
回答1:
I think this is an issue with how Pandas handles attributes. Even a simplified attempt at inheritance does not work out:
class Foo(pd.DataFrame):
def __init__(self, tag, df):
self._tag = tag
Traceback (most recent call last):
File "c:\Users\Michael\.vscode\extensions\ms-python.python-2019.6.24221\pythonFiles\ptvsd_launcher.py", line 43, in <module>
main(ptvsdArgs)
File "c:\Users\Michael\.vscode\extensions\ms-python.python-2019.6.24221\pythonFiles\lib\python\ptvsd\__main__.py", line 434, in main
run()
File "c:\Users\Michael\.vscode\extensions\ms-python.python-2019.6.24221\pythonFiles\lib\python\ptvsd\__main__.py", line 312, in run_file
runpy.run_path(target, run_name='__main__')
File "C:\Users\Michael\AppData\Local\Programs\Python\Python37-32\lib\runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
File "C:\Users\Michael\AppData\Local\Programs\Python\Python37-32\lib\runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "C:\Users\Michael\AppData\Local\Programs\Python\Python37-32\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "c:\Users\Michael\Desktop\sandbox\sandbox.py", line 8, in <module>
foo = Foo('mytag', pd.DataFrame({'a':[1,2,3],'b':[4,5,6]}))
File "c:\Users\Michael\Desktop\sandbox\sandbox.py", line 6, in __init__
self._tag = tag
File "c:\Users\Michael\Desktop\sandbox\venv\lib\site-packages\pandas\core\generic.py", line 5205, in __setattr__
existing = getattr(self, name)
File "c:\Users\Michael\Desktop\sandbox\venv\lib\site-packages\pandas\core\generic.py", line 5178, in __getattr__
if self._info_axis._can_hold_identifiers_and_holds_name(name):
File "c:\Users\Michael\Desktop\sandbox\venv\lib\site-packages\pandas\core\generic.py", line 5178, in __getattr__
if self._info_axis._can_hold_identifiers_and_holds_name(name):
File "c:\Users\Michael\Desktop\sandbox\venv\lib\site-packages\pandas\core\generic.py", line 5178, in __getattr__
if self._info_axis._can_hold_identifiers_and_holds_name(name):
[Previous line repeated 487 more times]
File "c:\Users\Michael\Desktop\sandbox\venv\lib\site-packages\pandas\core\generic.py", line 489, in _info_axis
return getattr(self, self._info_axis_name)
File "c:\Users\Michael\Desktop\sandbox\venv\lib\site-packages\pandas\core\generic.py", line 5163, in __getattr__
def __getattr__(self, name):
File "c:\Users\Michael\.vscode\extensions\ms-python.python-2019.6.24221\pythonFiles\lib\python\ptvsd\_vendored\pydevd\_pydevd_bundle\pydevd_trace_dispatch_regular.py", line 362, in __call__
is_stepping = pydev_step_cmd != -1
RecursionError: maximum recursion depth exceeded in comparison
I think it's their use of __getattribute__(), which throws an error when it finds an unknown attribute. They're overriding the default __getattr__() behavior, which I'm guessing messes with inheritance.
回答2:
Michael's answer matches my findings in looking at their code. DataFrame inherits from NDFrame, which also overrides __setattr__, so that probably contributes to this issue as well.
The most straightforward solution here would be to create a class that uses a dataframe as an attribute so that your own attributes are settable.
class Foo:
def __init__(self, tag, df):
self.df = df
self._tag = tag
*Also: I would consider trying dill if the native pickle fails to pickle complex objects like these. After $ pip install dill, all you need to do is import dill as pickle since it has the same method names as pickle.
回答3:
How strange, I posted a similar question at almost the same time. And in a follow-up remark, I've discovered something even more basic: meta-data you define yourself in a DataFrame subclass does not even survive SLICING operations.
After you create your instance of foo, print it, and print foo._tag, try this:
bar = foo[1:]
print(bar)
print(bar._tag)
This also returns an AttributeError, same as your pickle-unpickle operation.
There might be good reasons to change or even remove meta-data when you slice. But you might very well want to preserve it. I don't know whether there is a single point in the Pandas code which affects both slicing and pickling, but I suspect there is.
来源:https://stackoverflow.com/questions/57237664/cant-unpickle-class-that-inherits-from-pandas-dataframe