Using Scikit's LabelEncoder correctly across multiple programs

后端 未结 5 581
无人共我
无人共我 2020-12-02 17:32

The basic task that I have at hand is

a) Read some tab separated data.

b) Do some basic preprocessing

c) For each categorical column use LabelE

5条回答
  •  盖世英雄少女心
    2020-12-02 18:00

    According to the LabelEncoder implementation, the pipeline you've described will work correctly if and only if you fit LabelEncoders at the test time with data that have exactly the same set of unique values.

    There's a somewhat hacky way to reuse LabelEncoders you got during train. LabelEncoder has only one property, namely, classes_. You can pickle it, and then restore like

    Train:

    encoder = LabelEncoder()
    encoder.fit(X)
    numpy.save('classes.npy', encoder.classes_)
    

    Test

    encoder = LabelEncoder()
    encoder.classes_ = numpy.load('classes.npy')
    # Now you should be able to use encoder
    # as you would do after `fit`
    

    This seems more efficient than refitting it using the same data.

提交回复
热议问题