Keeping NaNs with pandas dataframe inequalities

前端未结

关注

 3  746

I have a pandas.DataFrame object that contains about 100 columns and 200000 rows of data. I am trying to convert it to a bool dataframe where True means that the value is gr

相关标签:

3条回答

悲哀的现实

2020-12-17 22:14
You can do:
```
new_df = df >= threshold
new_df[df.isnull()] = np.NaN
```
But that is different from what you will get using the apply method. Here your mask has float dtype containing NaN, 0.0 and 1.0. In the apply solution you get object dtype with NaN, False, and True.

Neither are OK to be used as a mask because you might not get what you want. IEEE says that any NaN comparison must yield False and the apply method is implicitly violates that by returning NaN!

The best option is to keep track of the NaNs separately and df.isnull() is quite fast when bottleneck is installed.
0 讨论(0)
发布评论:

提交评论
- 加载中...
傲寒

2020-12-17 22:14
You can check for NaNs separately using this post: Python - find integer index of rows with NaN in pandas
```
df.isnull()
```
Combine the output of isnull with df >= threshold using bitwise or:
```
df.isnull() | df >= threshold
```
You can expect the two masks to take closer to 200ms to compute and combine, but that should be far enough away from 20s to be OK.
0 讨论(0)
发布评论:

提交评论
- 加载中...
无人共我

2020-12-17 22:36
In this situation I use an indicator array of floats, coded as: 0=False, 1=True, and NaN=missing. A Pandas DataFrame with bool dtype cannot have missing values, and a DataFrame with object dtype holding a mix of Python bool and float objects is not efficient. This leads us to using DataFrames with np.float64 dtype. numpy.sign(x - threshold) gives -1 = (x < threshold), 0 = (x == threshold) and +1 = (x > threshold) for your comparison, which might be good enough for your purposes, but if you really need 0/1 coding, the conversion can be made in-place. Timings below are on a 200K length array x:
```
In [45]: %timeit y = (x > 0); y[pd.isnull(x)] = np.nan
100 loops, best of 3: 8.71 ms per loop

In [46]: %timeit y = np.sign(x)
100 loops, best of 3: 1.82 ms per loop

In [47]: %timeit y = np.sign(x); y += 1; y /= 2
100 loops, best of 3: 3.78 ms per loop
```
0 讨论(0)
发布评论:

提交评论
- 加载中...