Trouble with NaNs: set_index().reset_index() corrupts data

与世无争的帅哥 提交于 2021-01-29 03:13:19

问题


I read that NaNs are problematic, but the following causes an actual corruption of my data, rather than an error. Is this a bug? Have I missed something basic in the documentation? I would like the second command to give an error or to give the same response as the first command:

ipdb> df
    year  PRuid  QC       data
18  2007  nonQC   0  8.014261
19  2008  nonQC   0  7.859152
20  2010  nonQC   0  7.468260
21  1985     10 NaN  0.861403
22  1985     11 NaN  0.878531
23  1985     12 NaN  0.842704
24  1985     13 NaN  0.785877
25  1985     24   1  0.730625
26  1985     35 NaN  0.816686
27  1985     46 NaN  0.819271
28  1985     47 NaN  0.807050
ipdb> df.set_index(['year','PRuid','QC']).reset_index()
    year  PRuid  QC       data
0   2007  nonQC   0  8.014261
1   2008  nonQC   0  7.859152
2   2010  nonQC   0  7.468260
3   1985     10   1  0.861403
4   1985     11   1  0.878531
5   1985     12   1  0.842704
6   1985     13   1  0.785877
7   1985     24   1  0.730625
8   1985     35   1  0.816686
9   1985     46   1  0.819271
10  1985     47   1  0.807050

The value of "QC" is actually changed to 1 from NaN where it should be NaN.

Btw, for symmetry I added the ".reset_index()", but the data corruption is introduced by set_index.

And in case this is interesting, the version is:

pd.version
<module 'pandas.version' from '/usr/lib/python2.6/site-packages/pandas-0.10.1-py2.6-linux-x86_64.egg/pandas/version.pyc'>

回答1:


So this was a bug. By the end of May 2013, pandas 0.11.1 should be released with the bug fix (see comments on this question). In the mean time, I avoided using a value with NaNs in any multiindex, for instance by using some other flag value (-99) for the NaNs in the column 'QC'.



来源:https://stackoverflow.com/questions/16511319/trouble-with-nans-set-index-reset-index-corrupts-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!