How do you deal with missing data using numpy/scipy?

隐身守侯 提交于 2019-12-05 02:14:31

If you are willing to consider a library, pandas (http://pandas.pydata.org/) is a library built on top of numpy which amongst many other things provides:

Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form

I've been using it for almost one year in the financial industry where missing and badly aligned data is the norm and it really made my life easier.

I also question the problem with masked arrays. Here are a couple of examples:

import numpy as np
data = np.ma.masked_array(np.arange(10))
data[5] = np.ma.masked # Mask a specific value

data[data>6] = np.ma.masked # Mask any value greater than 6

# Same thing done at initialization time
init_data = np.arange(10)
data = np.ma.masked_array(init_data, mask=(init_data > 6))

Masked arrays are the anwswer, as DpplerShift describes. For quick and dirty use, you can use fancy indexing with boolean arrays:

>>> import numpy as np
>>> data = np.arange(10)
>>> valid_idx = data % 2 == 0 #pretend that even elements are missing

>>> # Get non-missing data
>>> data[valid_idx]
array([0, 2, 4, 6, 8])

You can now use valid_idx as a quick mask on other data as well

>>> comparison = np.arange(10) + 10
>>> comparison[valid_idx]
array([10, 12, 14, 16, 18])

See sklearn.preprocessing.Imputer

import numpy as np
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit([[1, 2], [np.nan, 3], [7, 6]])
X = [[np.nan, 2], [6, np.nan], [7, 6]]
print(imp.transform(X))  

Example from http://scikit-learn.org/

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!