How to find ngram frequency of a column in a pandas dataframe?

℡╲_俬逩灬. 提交于 2019-11-28 23:23:46

问题


Below is the input pandas dataframe I have.

I want to find the frequency of unigrams & bigrams. A sample of what I am expecting is shown below

How to do this using nltk or scikit learn?

I wrote the below code which takes a string as input. How to extend it to series/dataframe?

from nltk.collocations import *
desc='john is a guy person you him guy person you him'
tokens = nltk.word_tokenize(desc)
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(tokens)
finder.ngram_fd.viewitems()

回答1:


If your data is like

import pandas as pd
df = pd.DataFrame([
    'must watch. Good acting',
    'average movie. Bad acting',
    'good movie. Good acting',
    'pathetic. Avoid',
    'avoid'], columns=['description'])

You could use the CountVectorizer of the package sklearn:

from sklearn.feature_extraction.text import CountVectorizer
word_vectorizer = CountVectorizer(ngram_range=(1,2), analyzer='word')
sparse_matrix = word_vectorizer.fit_transform(df['description'])
frequencies = sum(sparse_matrix).toarray()[0]
pd.DataFrame(frequencies, index=word_vectorizer.get_feature_names(), columns=['frequency'])

Which gives you :

                frequency
good            3
pathetic        1
average movie   1
movie bad       2
watch           1
good movie      1
watch good      3
good acting     2
must            1
movie good      2
pathetic avoid  1
bad acting      1
average         1
must watch      1
acting          1
bad             1
movie           1
avoid           1

EDIT

fit will just "train" your vectorizer : it will split the words of your corpus and create a vocabulary with it. Then transform can take a new document and create vector of frequency based on the vectorizer vocabulary.

Here your training set is your output set, so you can do both at the same time (fit_transform). Because you have 5 documents, it will create 5 vectors as a matrix. You want a global vector, so you have to make a sum.

EDIT 2

For big dataframes, you can speed up the frequencies computation by using:

frequencies = sum(sparse_matrix).data


来源:https://stackoverflow.com/questions/36572221/how-to-find-ngram-frequency-of-a-column-in-a-pandas-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!