There are good solutions to impute panda dataframe. But since I am working mainly with numpy arrays, I have to create new panda DataFrame object, impute and then convert back to numpy array as follows:
nomDF=pd.DataFrame(x_nominal) #Convert np.array to pd.DataFrame nomDF=nomDF.apply(lambda x:x.fillna(x.value_counts().index[0])) #replace NaN with most frequent in each column x_nominal=nomDF.values #convert back pd.DataFrame to np.array
Is there a way to directly impute in numpy array?
We could use Scipy's mode
to get the highest value in each column. Leftover work would be to get the NaN
indices and replace those in input array with the mode
values by indexing.
So, the implementation would look something like this -
from scipy.stats import mode R,C = np.where(np.isnan(x_nominal)) vals = mode(x_nominal,axis=0)[0].ravel() x_nominal[R,C] = vals[C]
Please note that for pandas
, with value_counts
, we would be choosing the highest value in case of many categories/elements with the same highest count. i.e. in tie situations. With Scipy's mode
, it would be lowest one for such tie cases.
If you are dealing with such mixed dtype of strings
and NaNs
, I would suggest few modifications, keeping the last step unchanged to make it work -
x_nominal_U3 = x_nominal.astype('U3') R,C = np.where(x_nominal_U3=='nan') vals = mode(x_nominal_U3,axis=0)[0].ravel()
This throws a warning for the mode calculation : RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored. "values. nan values will be ignored.", RuntimeWarning)
. But since, we actually want to ignore NaNs
for that mode calculation, we should be okay there.