feature selection using logistic regression

柔情痞子 提交于 2019-12-01 13:20:40

sklearn's GridSearchCV has some pretty neat methods to give you the best feature set. For example, consider the following code

pipeline = Pipeline([
    ('vect', TfidfVectorizer(stop_words='english',sublinear_tf=True)),
    ('clf', LogisticRegression())
    ])

    parameters = {
        'vect__max_df': (0.25, 0.5, 0.6, 0.7, 1.0),
        'vect__ngram_range': ((1, 1), (1, 2), (2,3), (1,3), (1,4), (1,5)),
        'vect__use_idf': (True, False),
        'clf__C': (0.1, 1, 10, 20, 30)
    }

here the parameters array holds all of the different parameters that i need to consider. notice the use if vect__max_df. max_df is an actual key that is used by my vectorizer, which is my feature selector. So,

'vect__max_df': (0.25, 0.5, 0.6, 0.7, 1.0),

actually specifies that i want to try out the above 5 values for my vectorizer. Similarly for the others. Notice how i have tied my vectorizer to the key 'vect' and my classifier to the key 'clf'. Can you see the pattern? Moving on

    traindf = pd.read_json('../../data/train.json')

    traindf['ingredients_clean_string'] = [' , '.join(z).strip() for z in traindf['ingredients']]  

    traindf['ingredients_string'] = [' '.join([WordNetLemmatizer().lemmatize(re.sub('[^A-Za-z]', ' ', line)) for line in lists]).strip() for lists in traindf['ingredients']]       

    X, y = traindf['ingredients_string'], traindf['cuisine'].as_matrix()

    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)

    grid_search = GridSearchCV(pipeline, parameters, n_jobs=3, verbose=1, scoring='accuracy')
    grid_search.fit(X_train, y_train)

    print ('best score: %0.3f' % grid_search.best_score_)
    print ('best parameters set:')

    bestParameters = grid_search.best_estimator_.get_params()

    for param_name in sorted(parameters.keys()):
        print ('\t %s: %r' % (param_name, bestParameters[param_name]))

    predictions = grid_search.predict(X_test)
    print ('Accuracy:', accuracy_score(y_test, predictions))
    print ('Confusion Matrix:', confusion_matrix(y_test, predictions))
    print ('Classification Report:', classification_report(y_test, predictions))

note that the bestParameters array will give me the best set of parameters out of all the options that i specified while creating my pipeline.

Hope this helps.

Edit: To get a list of features selected

so once you have your best set of parameters, create vectorizers and classifiers with those parameter values

vect = TfidfVectorizer('''use the best parameters here''')

then you basically train this vectorizer again. in doing so, the vectorizer will choose certain features from your training set.

traindf = pd.read_json('../../data/train.json')

        traindf['ingredients_clean_string'] = [' , '.join(z).strip() for z in traindf['ingredients']]  

        traindf['ingredients_string'] = [' '.join([WordNetLemmatizer().lemmatize(re.sub('[^A-Za-z]', ' ', line)) for line in lists]).strip() for lists in traindf['ingredients']]       

        X, y = traindf['ingredients_string'], traindf['cuisine'].as_matrix()

        X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7)

       termDocMatrix = vect.fit_transform(X_train, y_train)

now, the termDocMatrix has all of the selected features. also, you can use the vectorizer to get the feature names. lets say you want to get the top 100 features. and your metric for comparison is the chi square score

getKbest = SelectKBest(chi2, k = 100)

now just

print(np.asarray(vect.get_feature_names())[getKbest.get_support()])

should give you the top 100 features. try this.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!