问题
I would like to do some text analysis on job descriptions and was going to use nltk. I can build a dictionary and remove the stopwords, which is part of what I want. However in addition to the single words and their frequencies I would like to keep meaningful 'word groups' and count them as well.
For example in job descriptions containing 'machine learning' I don't want to consider 'machine' and 'learning' separately but keep retain the word group in my dictionary if it frequently occurs together. What is the most efficient method to do that? (I think I wont need to go beyond word groups containing 2 or words). And: At which point should I do the stopword removal?
Here is an example:
text = 'As a Data Scientist, you will focus on machine
learning and Natural Language Processing'
The dictionary I would like to have is:
dict = ['data scientist', 'machine learning', 'natural language processing',
'data', 'scientist', 'focus', 'machine', 'learning', 'natural'
'language', 'processing']
回答1:
Sounds like what you want do is use collocations from nltk.
回答2:
Tokenize your multi-word expressions into tuples, then put them in a set for easy lookup. The easiest way is to use nltk.ngrams which allows you to iterate directly over the ngrams in your text. Since your sample data includes a trigram, here's a search for n up to 3.
raw_keywords = [ 'data scientist', 'machine learning', 'natural language processing',
'data', 'scientist', 'focus', 'machine', 'learning', 'natural'
'language', 'processing']
keywords = set(tuple(term.split()) for term in raw_keywords)
tokens = nltk.word_tokenize(text.lower())
# Scan text once for each ngram size.
for n in 1, 2, 3:
for ngram in nltk.ngrams(tokens, n):
if ngram in keywords:
print(ngram)
If you have huge amounts of text, you could check you you'll get a speed-up by iterating over maximal ngrams only (with the option pad_right=True to avoid missing small ngram sizes). The number of lookups is the same both ways, so I doubt it will make much difference, except in the order of returned results.
for ngram in nltk.ngrams(tokens, n, pad_right=True):
for k in range(n):
if ngram[:k+1] in keywords:
print(ngram[:k+1])
As for stopword removal: If you remove them, you'll produce ngrams where there were none before, e.g., "sewing machine and learning center" will match "machine learning" after stopword removal. You'll have to decide if this is something you want, or not. If it were me I would remove punctuation before the keyword scan, but leave the stopwords in place.
回答3:
Thanks @Batman, I played around a bit with collocations and ended up only needing a couple of lines of code. (Obviously 'meaningful text' should be a lot longer to find actual collocations)
meaningful_text = 'As a Data Scientist, you will focus on machine
learning and Natural Language Processing'
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(word_tokenize(meaningful_text))
scored = finder.score_ngrams(bigram_measures.raw_freq)
sorted(scored, key=lambda s: s[1], reverse=True)
来源:https://stackoverflow.com/questions/42752356/create-a-dictionary-with-word-groups