Let\'s suppose I have a column with categorical data \"red\" \"green\" \"blue\" and empty cells
red
green
red
blue
NaN
I\'m sure that the NaN b
The simplest strategy for handling missing data is to remove records that contain a missing value.
The scikit-learn library provides the Imputer()
pre-processing class that can be used to replace missing values. Since it is categorical data, using mean as replacement value is not recommended. You can use
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)
The Imputer class operates directly on the NumPy array instead of the DataFrame.
Last but not least, not ALL ML algorithm cannot handle missing value. Different implementations of ML also different.