How to impute each categorical column in numpy array

匿名 (未验证) 提交于 2019-12-03 01:45:01

问题:

There are good solutions to impute panda dataframe. But since I am working mainly with numpy arrays, I have to create new panda DataFrame object, impute and then convert back to numpy array as follows:

nomDF=pd.DataFrame(x_nominal) #Convert np.array to pd.DataFrame nomDF=nomDF.apply(lambda x:x.fillna(x.value_counts().index[0])) #replace NaN with most frequent in each column x_nominal=nomDF.values #convert back pd.DataFrame to np.array 

Is there a way to directly impute in numpy array?

回答1:

We could use Scipy's mode to get the highest value in each column. Leftover work would be to get the NaN indices and replace those in input array with the mode values by indexing.

So, the implementation would look something like this -

from scipy.stats import mode  R,C = np.where(np.isnan(x_nominal)) vals = mode(x_nominal,axis=0)[0].ravel() x_nominal[R,C] = vals[C] 

Please note that for pandas, with value_counts, we would be choosing the highest value in case of many categories/elements with the same highest count. i.e. in tie situations. With Scipy's mode, it would be lowest one for such tie cases.

If you are dealing with such mixed dtype of strings and NaNs, I would suggest few modifications, keeping the last step unchanged to make it work -

x_nominal_U3 = x_nominal.astype('U3') R,C = np.where(x_nominal_U3=='nan') vals = mode(x_nominal_U3,axis=0)[0].ravel() 

This throws a warning for the mode calculation : RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored. "values. nan values will be ignored.", RuntimeWarning). But since, we actually want to ignore NaNs for that mode calculation, we should be okay there.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!