How to add another feature (length of text) to current bag of words classification? Scikit-learn

后端 未结 2 759
故里飘歌
故里飘歌 2020-12-05 14:48

I am using bag of words to classify text. It\'s working well but I am wondering how to add a feature which is not a word.

Here is my sample code.

im         


        
2条回答
  •  甜味超标
    2020-12-05 15:02

    As shown in the comments, this is a combination of a FunctionTransformer, a FeaturePipeline and a FeatureUnion.

    import numpy as np
    from sklearn.pipeline import Pipeline, FeatureUnion
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.svm import LinearSVC
    from sklearn.feature_extraction.text import TfidfTransformer
    from sklearn.multiclass import OneVsRestClassifier
    from sklearn.preprocessing import FunctionTransformer
    
    X_train = np.array(["new york is a hell of a town",
                        "new york was originally dutch",
                        "new york is also called the big apple",
                        "nyc is nice",
                        "the capital of great britain is london. london is a huge metropolis which has a great many number of people living in it. london is also a very old town with a rich and vibrant cultural history.",
                        "london is in the uk. they speak english there. london is a sprawling big city where it's super easy to get lost and i've got lost many times.",
                        "london is in england, which is a part of great britain. some cool things to check out in london are the museum and buckingham palace.",
                        "london is in great britain. it rains a lot in britain and london's fogs are a constant theme in books based in london, such as sherlock holmes. the weather is really bad there.",])
    y_train = np.array([[0],[0],[0],[0],[1],[1],[1],[1]])
    
    X_test = np.array(["it's a nice day in nyc",
                       'i loved the time i spent in london, the weather was great, though there was a nip in the air and i had to wear a jacket.'
                       ])   
    target_names = ['Class 1', 'Class 2']
    
    
    def get_text_length(x):
        return np.array([len(t) for t in x]).reshape(-1, 1)
    
    classifier = Pipeline([
        ('features', FeatureUnion([
            ('text', Pipeline([
                ('vectorizer', CountVectorizer(min_df=1,max_df=2)),
                ('tfidf', TfidfTransformer()),
            ])),
            ('length', Pipeline([
                ('count', FunctionTransformer(get_text_length, validate=False)),
            ]))
        ])),
        ('clf', OneVsRestClassifier(LinearSVC()))])
    
    classifier.fit(X_train, y_train)
    predicted = classifier.predict(X_test)
    predicted
    

    This will add the length of the text to the features used by the classifier.

提交回复
热议问题