问题
I am trying to put my dataset into the MATLAB [ranked,weights] = relieff(X,Ylogical,10, 'categoricalx', 'on')
function to rank the importance of my predictor features. The dataset<double n*m>
has n
observations and m
discrete (i.e. categorical) features. It happens that each observation (row) in my dataset has at least one NaN value. These NaNs represent unobserved, i.e. missing or null, predictor values in the dataset. (There is no corruption in the dataset, it is just incomplete.)
relieff() uses this function below to remove any rows that contain a NaN:
function [X,Y] = removeNaNs(X,Y)
% Remove observations with missing data
NaNidx = bsxfun(@or,isnan(Y),any(isnan(X),2));
X(NaNidx,:) = [];
Y(NaNidx,:) = [];
This is not ideal, especially for my case, since it leaves me with X=[]
and Y=[]
(i.e. no observations!)
In this case:
1) Would replacing all NaN's with a random value, e.g. 99999, help? By doing this, I am introducing a new feature state for all the predictor features so I guess it is not ideal.
2) or is replacing NaNs with the mode of the corresponding feature column vector (as below) statistically more sound? (I am not vectorising for clarity's sake)
function [matrixdata] = replaceNaNswithModes(matrixdata)
for i=1: size(matrixdata,2)
cv= matrixdata(:,i);
modevalue= mode(cv);
cv(find(isnan(cv))) = modevalue;
matrixdata(:,i) = cv;
end
3) Or any other sensible way that would make sense for "categorical" data?
P.S: This link gives possible ways to handle missing data.
回答1:
I suggest to use a table instead of a matrix. Then you have functions such as ismissing (for the entire table), and isundefined to deal with missing values for categorical variables.
T = array2table(matrix);
T = standardizeMissing(T); % NaN is standard for double but this
% can be useful for other data type
var1 = categorical(T.var1);
missing = isundefined(var1);
T = T(missing,:); % removes lines with NaN
matrix = table2array(T);
回答2:
For a start both solutiona (1) and (2) do not help you handle your data more properly, since NaN is in fact a labelling that is handled appropriately by Matlab; warnings will be issued. What you should do is:
- Handle the NaNs per case
- Use try catch blocks
NaN is like a number, and there is nothing bad about it. Even is you divide by NaN matlab will treat it properly and give you a NaN.
If you still want to replace them, then you will need an assumption that holds. For example, if your data is engine speeds in a timeseries that have been input by the engine operator, but some time instances have not been specified then there are more than one ways to handle the NaN that will appear in the matrix.
- Replace with 0s
- Replace with the previous value
- Replace with the next value
- Replace with the average of the previous and the next value and many more.
As you can see your problem is ill-posed, and depends on the predictor and the data source.
In case of categorical data, e.g. three categories {0,1,2} and supposing NaN occurs in Y.
for k=1:size(Y,2)
[ id ]=isnan(Y(:,k);
m(k)=median(Y(~id),k);
Y(id,k)=round(m(k));
end
I feel really bad that I had to write a for-loop but I cannot see any other way. As you can see I made a number of assumptions, by using median
and round
. You may want to use a threshold depending on you knowledge about the data.
回答3:
I think the answer to this has been given by gd047 in dimension-reduction-in-categorical-data-with-missing-values:
I am going to look into this, if anyone has any other suggestions or particular MatLab implementations, it would be great to hear.
回答4:
You can take a look at this page http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html the firs a1a, it says transforming categorical into binary. Could possibly work. (:
来源:https://stackoverflow.com/questions/9569886/matlab-missing-data-handling-in-categorical-data