How to reverse sklearn.OneHotEncoder transform to recover original data?

后端 未结 8 1839
深忆病人
深忆病人 2020-12-13 07:34

I encoded my categorical data using sklearn.OneHotEncoder and fed them to a random forest classifier. Everything seems to work and I got my predicted output bac

相关标签:
8条回答
  • 2020-12-13 08:35

    Since version 0.20 of scikit-learn, the active_features_ attribute of the OneHotEncoder class has been deprecated, so I suggest to rely on the categories_ attribute instead.

    The below function can help you recover the original data from a matrix that has been one-hot encoded:

    def reverse_one_hot(X, y, encoder):
        reversed_data = [{} for _ in range(len(y))]
        all_categories = list(itertools.chain(*encoder.categories_))
        category_names = ['category_{}'.format(i+1) for i in range(len(encoder.categories_))]
        category_lengths = [len(encoder.categories_[i]) for i in range(len(encoder.categories_))]
    
        for row_index, feature_index in zip(*X.nonzero()):
            category_value = all_categories[feature_index]
            category_name = get_category_name(feature_index, category_names, category_lengths)
            reversed_data[row_index][category_name] = category_value
            reversed_data[row_index]['target'] = y[row_index]
    
        return reversed_data
    
    
    def get_category_name(index, names, lengths):
    
        counter = 0
        for i in range(len(lengths)):
            counter += lengths[i]
            if index < counter:
                return names[i]
        raise ValueError('The index is higher than the number of categorical values')
    

    To test it, I have created a small data set that includes the ratings that users have given to users

    data = [
        {'user_id': 'John', 'item_id': 'The Matrix', 'rating': 5},
        {'user_id': 'John', 'item_id': 'Titanic', 'rating': 1},
        {'user_id': 'John', 'item_id': 'Forrest Gump', 'rating': 2},
        {'user_id': 'John', 'item_id': 'Wall-E', 'rating': 2},
        {'user_id': 'Lucy', 'item_id': 'The Matrix', 'rating': 5},
        {'user_id': 'Lucy', 'item_id': 'Titanic', 'rating': 1},
        {'user_id': 'Lucy', 'item_id': 'Die Hard', 'rating': 5},
        {'user_id': 'Lucy', 'item_id': 'Forrest Gump', 'rating': 2},
        {'user_id': 'Lucy', 'item_id': 'Wall-E', 'rating': 2},
        {'user_id': 'Eric', 'item_id': 'The Matrix', 'rating': 2},
        {'user_id': 'Eric', 'item_id': 'Die Hard', 'rating': 3},
        {'user_id': 'Eric', 'item_id': 'Forrest Gump', 'rating': 5},
        {'user_id': 'Eric', 'item_id': 'Wall-E', 'rating': 4},
        {'user_id': 'Diane', 'item_id': 'The Matrix', 'rating': 4},
        {'user_id': 'Diane', 'item_id': 'Titanic', 'rating': 3},
        {'user_id': 'Diane', 'item_id': 'Die Hard', 'rating': 5},
        {'user_id': 'Diane', 'item_id': 'Forrest Gump', 'rating': 3},
    ]
    
    data_frame = pandas.DataFrame(data)
    data_frame = data_frame[['user_id', 'item_id', 'rating']]
    ratings = data_frame['rating']
    data_frame.drop(columns=['rating'], inplace=True)
    

    If we are building a prediction model, we have to remember to delete the dependent variable (in this case the rating) from the DataFrame before we encode it.

    ratings = data_frame['rating']
    data_frame.drop(columns=['rating'], inplace=True)
    

    Then we proceed to do the encoding

    ohc = OneHotEncoder()
    encoded_data = ohc.fit_transform(data_frame)
    print(encoded_data)
    

    Which results in:

      (0, 2)    1.0
      (0, 6)    1.0
      (1, 2)    1.0
      (1, 7)    1.0
      (2, 2)    1.0
      (2, 5)    1.0
      (3, 2)    1.0
      (3, 8)    1.0
      (4, 3)    1.0
      (4, 6)    1.0
      (5, 3)    1.0
      (5, 7)    1.0
      (6, 3)    1.0
      (6, 4)    1.0
      (7, 3)    1.0
      (7, 5)    1.0
      (8, 3)    1.0
      (8, 8)    1.0
      (9, 1)    1.0
      (9, 6)    1.0
      (10, 1)   1.0
      (10, 4)   1.0
      (11, 1)   1.0
      (11, 5)   1.0
      (12, 1)   1.0
      (12, 8)   1.0
      (13, 0)   1.0
      (13, 6)   1.0
      (14, 0)   1.0
      (14, 7)   1.0
      (15, 0)   1.0
      (15, 4)   1.0
      (16, 0)   1.0
      (16, 5)   1.0
    

    After encoding the we can reverse using the reverse_one_hot function we defined above, like this:

    reverse_data = matrix_utils.reverse_one_hot(encoded_data, ratings, ohc)
    print(pandas.DataFrame(reverse_data))
    

    Which gives us:

       category_1    category_2  target
    0        John    The Matrix       5
    1        John       Titanic       1
    2        John  Forrest Gump       2
    3        John        Wall-E       2
    4        Lucy    The Matrix       5
    5        Lucy       Titanic       1
    6        Lucy      Die Hard       5
    7        Lucy  Forrest Gump       2
    8        Lucy        Wall-E       2
    9        Eric    The Matrix       2
    10       Eric      Die Hard       3
    11       Eric  Forrest Gump       5
    12       Eric        Wall-E       4
    13      Diane    The Matrix       4
    14      Diane       Titanic       3
    15      Diane      Die Hard       5
    16      Diane  Forrest Gump       3
    
    0 讨论(0)
  • 2020-12-13 08:36

    How to one-hot encode

    See https://stackoverflow.com/a/42874726/562769

    import numpy as np
    nb_classes = 6
    data = [[2, 3, 4, 0]]
    
    def indices_to_one_hot(data, nb_classes):
        """Convert an iterable of indices to one-hot encoded labels."""
        targets = np.array(data).reshape(-1)
        return np.eye(nb_classes)[targets]
    

    How to reverse

    def one_hot_to_indices(data):
        indices = []
        for el in data:
            indices.append(list(el).index(1))
        return indices
    
    
    hot = indices_to_one_hot(orig_data, nb_classes)
    indices = one_hot_to_indices(hot)
    
    print(orig_data)
    print(indices)
    

    gives:

    [[2, 3, 4, 0]]
    [2, 3, 4, 0]
    
    0 讨论(0)
提交回复
热议问题