问题
here is an example mentionning that fitctree of matlab takes into account the features order ! why ?
load ionosphere % Contains X and Y variables
Mdl = fitctree(X,Y)
view(Mdl,'mode','graph');
X1=fliplr(X);
Mdl1 = fitctree(X1,Y)
view(Mdl1,'mode','graph');
Not the same model, thus not the same classification accuracy despite dealing with the same features ?
回答1:
In your example, X
contains 34 predictors. The predictors contain no names and fitctree
just refers to them by their column numbers x1, x2, ..., x34
. If you flip the table, the column number changes and therefore their name. So x1 -> x34
. x2 -> x33
, etc..
In for most nodes this does not matter because CART always divides a node by the predictor that maximises the impurity gain between the two child nodes. But sometimes there are multiple predictors which result in the same impurity gain. Then it just picks the one with the lowest column number. And since the column number changed by reordering the predictors, you end up with a different predictor at that node.
E.g. let's look at the marked split:
Original order (mdl
):
Flipped order (mdl1
):
Up to this point always the same predictor and values have been chosen. Names changed due to order, e.g. x5
in the old data = x30
in the new model. But x3
and x6
are actually different predictors. x6
in the flipped order is x29
in the original order.
A scatter plot between those predictors shows how this could happen:
Where blue and cyan lines mark the splits performed by mdl
and mdl1
respectively at that node. As we can see, both splits yield child nodes with the same number of elements per label! Therefore CART can chose any of the two predictors, it will cause the same impurity gain.
In that case it seems to just pick the one with the lower column number. In the non-flipped table x3
is chosen instead of x29
because 3 < 29
. But if you flip the tables, x3
becomes x32
and x29
becomes x6
. Since 6 < 32
you now end up with x6
, the original x29
.
Ultimately this does not matter - the decision tree of the flipped table is not better or worse. It only happens in the lower nodes where the tree starts to overfit. So you really don't have to care about it.
Appendix:
Code for scatter plot generation:
load ionosphere % Contains X and Y variables
Mdl = fitctree(X,Y);
view(Mdl,'mode','graph');
X1=fliplr(X);
Mdl1 = fitctree(X1,Y);
view(Mdl1,'mode','graph');
idx = (X(:,5)>=0.23154 & X(:,27)>=0.999945 & X(:,1)>=0.5);
remainder = X(idx,:);
labels = cell2mat(Y(idx,:));
gscatter(remainder(:,3), remainder(:,(35-6)), labels,'rgb','osd');
limits = [-1.5 1.5];
xlim(limits)
ylim(limits)
xlabel('predictor 3')
ylabel('predictor 29')
hold on
plot([0.73 0.73], limits, '-b')
plot(limits, [0.693 0.693], '-c')
legend({'b' 'g'})
来源:https://stackoverflow.com/questions/44543349/cart-algorithm-of-matlab-fitctree-takes-account-on-the-attributes-order-why