I want to calculate tf-idf from the documents below. I\'m using python and pandas.
import pandas as pd
df = pd.DataFrame({\'docId\': [1,2,3],
I found a slightly different method using CountVectorizer from sklearn. --count vectorizer: Ultraviolet Analysis word frequency --preprocessing/cleaning text: Usman Malik scraping tweets preprocessing I won't be covering preprocessing in this answer. Basically what you want to do is import CountVectorizer and fit your data to the CountVectorizer object, which will let you access the .vocabulary._items() feature, which will give you the vocabulary of your dataset (the unique words present and their frequencies, given any limiting parameters you pass into CountVectorizer like match feature number, etc)
Then, you're going to use the Tfidtransformer to generate tf-idf weights for the terms in a similar manner
I am coding in a jupyter notebook file using pandas and the pycharm ide
Here is a code snippet:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np
#https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
countVec = CountVectorizer(max_features= 5000, stop_words='english', min_df=.01, max_df=.90)
#%%
#use CountVectorizer.fit(self, raw_documents[, y] to learn vocabulary dictionary of all tokens in raw documents
#raw documents in this case will betweetsFrameWords["Text"] (processed text)
countVec.fit(tweetsFrameWords["Text"])
#useful debug, get an idea of the item list you generated
list(countVec.vocabulary_.items())
#%%
#convert to bag of words
#sparse matrix representation? (README: could use an edit/explanation)
countVec_count = countVec.transform(tweetsFrameWords["Text"])
#%%
#make array from number of occurrences
occ = np.asarray(countVec_count.sum(axis=0)).ravel().tolist()
#make a new data frame with columns term and occurrences, meaning word and number of occurences
bowListFrame = pd.DataFrame({'term': countVec.get_feature_names(), 'occurrences': occ})
print(bowListFrame)
#sort in order of number of word occurences, most->least. if you leave of ascending flag should default ASC
bowListFrame.sort_values(by='occurrences', ascending=False).head(60)
#%%
#now, convert to a more useful ranking system, tf-idf weights
#TfidfTransformer: scale raw word counts to a weighted ranking using the
#https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
tweetTransformer = TfidfTransformer()
#initial fit representation using transformer object
tweetWeights = tweetTransformer.fit_transform(countVec_count)
#follow similar process to making new data frame with word occurrences, but with term weights
tweetWeightsFin = np.asarray(tweetWeights.mean(axis=0)).ravel().tolist()
#now that we've done Tfid, make a dataframe with weights and names
tweetWeightFrame = pd.DataFrame({'term': countVec.get_feature_names(), 'weight': tweetWeightsFin})
print(tweetWeightFrame)
tweetWeightFrame.sort_values(by='weight', ascending=False).head(20)
Scikit-learn implementation is really easy :
from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer()
x = v.fit_transform(df['sent'])
There are plenty of parameters you can specify. See the documentation here
The output of fit_transform will be a sparse matrix, if you want to visualize it you can do x.toarray()
In [44]: x.toarray()
Out[44]:
array([[ 0.64612892, 0.38161415, 0. , 0.38161415, 0.38161415,
0. , 0.38161415],
[ 0. , 0.38161415, 0.64612892, 0.38161415, 0.38161415,
0. , 0.38161415],
[ 0. , 0.38161415, 0. , 0.38161415, 0.38161415,
0.64612892, 0.38161415]])
A simple solution is to use texthero:
import texthero as hero
df['tfidf'] = hero.tfidf(df['sent'])
In [5]: df.head()
Out[5]:
docId sent tfidf
0 1 This is the first sentence [0.3816141458138271, 0.6461289150464732, 0.381...
1 2 This is the second sentence [0.3816141458138271, 0.0, 0.3816141458138271, ...
2 3 This is the third sentence [0.3816141458138271, 0.0, 0.3816141458138271, ...