Dictvectorizer for list as one feature in Python Pandas and Scikit-learn

后端 未结 1 1820
执笔经年
执笔经年 2020-12-21 16:53

I have been trying to solve this for days, and although I have found a similar problem here How can i vectorize list using sklearn DictVectorizer, the solution is overly sim

相关标签:
1条回答
  • 2020-12-21 17:38

    If I have understood correctly you want a way to encode list values in order to have a feature dictionary that DictVectorizer could use. (One year too late but) something like this can be used depending on the case:

    my_dict_list = []
    
    for i in X:
        # create a new feature dictionary
        feat_dict = {}
        # add the features that are straight forward
        feat_dict['last-name'] = feature_full_last_name(i)
        feat_dict['dummy'] = 1
    
        # for the features that have a list of values iterate over the values and
        # create a custom feature for each value
        for two_letters in feature_twoLetters(feature_full_last_name(i)):
            # make sure the naming is unique enough so that no other feature
            # unrelated to this will have the same name/ key
            feat_dict['two-letter-substrings-' + two_letters] = True
    
        # save it to the feature dictionary list that will be used in Dict vectorizer
        my_dict_list.append(feat_dict)
    
    print my_dict_list
    
    from sklearn.feature_extraction import DictVectorizer
    dict_vect = DictVectorizer(sparse=False)
    transformed_x = dict_vect.fit_transform(my_dict_list)
    print transformed_x
    

    Output:

    [{'dummy': 1, u'two-letter-substrings-er': True, 'last-name': u'Anderson', u'two-letter-substrings-on': True, u'two-letter-substrings-de': True, u'two-letter-substrings-An': True, u'two-letter-substrings-rs': True, u'two-letter-substrings-nd': True, u'two-letter-substrings-so': True}, {'dummy': 1, u'two-letter-substrings-ee': True, u'two-letter-substrings-Le': True, 'last-name': u'Lee'}]
    [[ 1.  1.  0.  1.  0.  1.  0.  1.  1.  1.  1.  1.]
     [ 1.  0.  1.  0.  1.  0.  1.  0.  0.  0.  0.  0.]]
    

    Another thing you could do (but I don't recommend) if you don't want to create as many features as the values in your lists is something like this:

    # sorting the values would be a good idea
    feat_dict[frozenset(feature_twoLetters(feature_full_last_name(i)))] = True
    # or 
    feat_dict[" ".join(feature_twoLetters(feature_full_last_name(i)))] = True
    

    but the first one means that you can't have any duplicate values and probably both don't make good features, especially if you need fine-tuned and detailed ones. Also, they reduce the possibility of two rows having the same combination of two letter combinations, thus the classification probably won't do well.

    Output:

    [{'dummy': 1, 'last-name': u'Anderson', frozenset([u'on', u'rs', u'de', u'nd', u'An', u'so', u'er']): True}, {'dummy': 1, 'last-name': u'Lee', frozenset([u'ee', u'Le']): True}]
    [{'dummy': 1, 'last-name': u'Anderson', u'An nd de er rs so on': True}, {'dummy': 1, u'Le ee': True, 'last-name': u'Lee'}]
    [[ 1.  0.  1.  1.  0.]
     [ 0.  1.  1.  0.  1.]]
    
    0 讨论(0)
提交回复
热议问题