问题
I have to classify some sentiments my data frame is like this
Phrase                      Sentiment
is it  good movie          positive
wooow is it very goode      positive
bad movie                  negative
i did some preprocessing as tokenisation stop words stemming etc ... and i get
Phrase                      Sentiment
[ good , movie  ]        positive
[wooow ,is , it ,very, good  ]   positive
[bad , movie ]            negative
I need finaly to get a dataframe wich the line are the text which the value is the tf_idf and the columns are the words like that
good     movie   wooow    very      bad                Sentiment
tf idf    tfidf_  tfidf    tf_idf    tf_idf               positive
( same thing for the 2 remaining lines)
回答1:
I'd use sklearn.feature_extraction.text.TfidfVectorizer, which is specifically designed for such tasks:
Demo:
In [63]: df
Out[63]:
                   Phrase Sentiment
0       is it  good movie  positive
1  wooow is it very goode  positive
2               bad movie  negative
Solution:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word', stop_words='english')
X = vect.fit_transform(df.pop('Phrase')).toarray()
r = df[['Sentiment']].copy()
del df
df = pd.DataFrame(X, columns=vect.get_feature_names())
del X
del vect
r.join(df)
Result:
In [31]: r.join(df)
Out[31]:
  Sentiment  bad  good     goode     wooow
0  positive  0.0   1.0  0.000000  0.000000
1  positive  0.0   0.0  0.707107  0.707107
2  negative  1.0   0.0  0.000000  0.000000
UPDATE: memory saving solution:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word', stop_words='english')
X = vect.fit_transform(df.pop('Phrase')).toarray()
for i, col in enumerate(vect.get_feature_names()):
    df[col] = X[:, i]
UPDATE2: related question where the memory issue was finally solved
回答2:
setup
df = pd.DataFrame([
        [['good', 'movie'], 'positive'],
        [['wooow', 'is', 'it', 'very', 'good'], 'positive'],
        [['bad', 'movie'], 'negative']
    ], columns=['Phrase', 'Sentiment'])
df
                        Phrase Sentiment
0                [good, movie]  positive
1  [wooow, is, it, very, good]  positive
2                 [bad, movie]  negative
Calculating term frequency tf
# use `value_counts` to get counts of items in list
tf = df.Phrase.apply(pd.value_counts).fillna(0)
print(tf)
   bad  good   is   it  movie  very  wooow
0  0.0   1.0  0.0  0.0    1.0   0.0    0.0
1  0.0   1.0  1.0  1.0    0.0   1.0    1.0
2  1.0   0.0  0.0  0.0    1.0   0.0    0.0
Calculating inverse document frequency idf
# add one to numerator and denominator just incase a term isn't in any document
# maximum value is log(N) and minimum value is zero
idf = np.log((len(df) + 1 ) / (tf.gt(0).sum() + 1))
idf
bad      0.693147
good     0.287682
is       0.693147
it       0.693147
movie    0.287682
very     0.693147
wooow    0.693147
dtype: float64
tfidf  
tdf * idf
        bad      good        is        it     movie      very     wooow
0  0.000000  0.287682  0.000000  0.000000  0.287682  0.000000  0.000000
1  0.000000  0.287682  0.693147  0.693147  0.000000  0.693147  0.693147
2  0.693147  0.000000  0.000000  0.000000  0.287682  0.000000  0.000000
来源:https://stackoverflow.com/questions/41904197/data-frame-of-tfidf-with-python