NLP process for combining common collocations

问题

I have a corpus that I'm using the tm package on in R (and also mirroring the same script in NLTK in python). I'm working with unigrams, but would like a parser of some kind to combine words commonly co-located to be as if one word---ie, I'd like to stop seeing "New" and "York" separately in my data set when they occur together, and see this particular pair represented as "New York" as if that were a single word, and alongside other unigrams.

What is this process called, of transforming meaningful, common n-grams onto the same footing as unigrams? Is it not a thing? Finally, what would the tm_map look like for this?

mydata.corpus <- tm_map(mydata.corpus, fancyfunction)

And/or in python?

回答1:

I recently had a similar question and played around with collocations

This was the solution I chose to identify pairs of collocated words:

from nltk import word_tokenize
from nltk.collocations import *

text = <a long text read in as string string>

tokenized_text = word_tokenize(text)

bigram_measures = nltk.collocations.BigramAssocMeasures(tokenized_text)
finder = BigramCollocationFinder.from_words()
scored = finder.score_ngrams(bigram_measures.raw_freq)

sorted(scored, key=lambda s: s[1], reverse=True)

来源：https://stackoverflow.com/questions/20710593/nlp-process-for-combining-common-collocations

标签

python

nlp

nltk

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!