Create a dictionary with 'word groups'

混江龙づ霸主 提交于 2019-12-12 18:29:10

问题


I would like to do some text analysis on job descriptions and was going to use nltk. I can build a dictionary and remove the stopwords, which is part of what I want. However in addition to the single words and their frequencies I would like to keep meaningful 'word groups' and count them as well.

For example in job descriptions containing 'machine learning' I don't want to consider 'machine' and 'learning' separately but keep retain the word group in my dictionary if it frequently occurs together. What is the most efficient method to do that? (I think I wont need to go beyond word groups containing 2 or words). And: At which point should I do the stopword removal?

Here is an example:

    text = 'As a Data Scientist, you will focus on machine 
            learning and Natural Language Processing'

The dictionary I would like to have is:

     dict = ['data scientist', 'machine learning', 'natural language processing', 
             'data', 'scientist', 'focus', 'machine', 'learning', 'natural' 
             'language', 'processing']

回答1:


Sounds like what you want do is use collocations from nltk.




回答2:


Tokenize your multi-word expressions into tuples, then put them in a set for easy lookup. The easiest way is to use nltk.ngrams which allows you to iterate directly over the ngrams in your text. Since your sample data includes a trigram, here's a search for n up to 3.

raw_keywords = [ 'data scientist', 'machine learning', 'natural language processing', 
         'data', 'scientist', 'focus', 'machine', 'learning', 'natural' 
         'language', 'processing']
keywords = set(tuple(term.split()) for term in raw_keywords)

tokens = nltk.word_tokenize(text.lower())
# Scan text once for each ngram size. 
for n in 1, 2, 3:
    for ngram in nltk.ngrams(tokens, n):
        if ngram in keywords:
            print(ngram)

If you have huge amounts of text, you could check you you'll get a speed-up by iterating over maximal ngrams only (with the option pad_right=True to avoid missing small ngram sizes). The number of lookups is the same both ways, so I doubt it will make much difference, except in the order of returned results.

for ngram in nltk.ngrams(tokens, n, pad_right=True):
    for k in range(n):
        if ngram[:k+1] in keywords:
            print(ngram[:k+1])

As for stopword removal: If you remove them, you'll produce ngrams where there were none before, e.g., "sewing machine and learning center" will match "machine learning" after stopword removal. You'll have to decide if this is something you want, or not. If it were me I would remove punctuation before the keyword scan, but leave the stopwords in place.




回答3:


Thanks @Batman, I played around a bit with collocations and ended up only needing a couple of lines of code. (Obviously 'meaningful text' should be a lot longer to find actual collocations)

meaningful_text = 'As a Data Scientist, you will focus on machine 
            learning and Natural Language Processing'

from nltk.collocations import *

bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(word_tokenize(meaningful_text))
scored = finder.score_ngrams(bigram_measures.raw_freq)
sorted(scored, key=lambda s: s[1], reverse=True)


来源:https://stackoverflow.com/questions/42752356/create-a-dictionary-with-word-groups

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!