可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

There are good solutions to impute panda dataframe. But since I am working mainly with numpy arrays, I have to create new panda DataFrame object, impute and then convert back to numpy array as follows:

nomDF=pd.DataFrame(x_nominal) #Convert np.array to pd.DataFrame nomDF=nomDF.apply(lambda x:x.fillna(x.value_counts().index[0])) #replace NaN with most frequent in each column x_nominal=nomDF.values #convert back pd.DataFrame to np.array

Is there a way to directly impute in numpy array?

回答1:

We could use Scipy's mode to get the highest value in each column. Leftover work would be to get the NaN indices and replace those in input array with the mode values by indexing.

So, the implementation would look something like this -

from scipy.stats import mode  R,C = np.where(np.isnan(x_nominal)) vals = mode(x_nominal,axis=0)[0].ravel() x_nominal[R,C] = vals[C]

Please note that for pandas, with value_counts, we would be choosing the highest value in case of many categories/elements with the same highest count. i.e. in tie situations. With Scipy's mode, it would be lowest one for such tie cases.

If you are dealing with such mixed dtype of strings and NaNs, I would suggest few modifications, keeping the last step unchanged to make it work -

x_nominal_U3 = x_nominal.astype('U3') R,C = np.where(x_nominal_U3=='nan') vals = mode(x_nominal_U3,axis=0)[0].ravel()

This throws a warning for the mode calculation : RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored. "values. nan values will be ignored.", RuntimeWarning). But since, we actually want to ignore NaNs for that mode calculation, we should be okay there.

文章来源: How to impute each categorical column in numpy array

标签

each

array