Gensim Phrases usage to filter n-grams

问题

I am using Gensim Phrases to identify important n-grams in my text as follows.

bigram = Phrases(documents, min_count=5)
trigram = Phrases(bigram[documents], min_count=5)

for sent in documents:
    bigrams_ = bigram[sent]
    trigrams_ = trigram[bigram[sent]]

However, this detects uninteresting n-grams such as special issue, important matter, high risk etc. I am particularly, interested in detecting concepts in the text such as machine learning, human computer interaction etc.

Is there a way to stop phrases detecting uninteresting n-grams as I have mentioned above in my example?

回答1:

Phrases has a configurable threshold parameter which adjusts the statistical cutoff for promoting word-pairs into phrases. (Larger thresholds mean fewer pairs become phrases.)

You can adjust that to try to make a greater proportion of its promoted phrases match your own ad hoc intuition about "interesting" phrases – but this class is still using a fairly crude method, without any awareness of grammar or domain knowledge beyond what's in the corpus. So any value that gets all/most of the phrases you want will likely include many uninteresting ones, or vice-versa.

If you have a priori knowledge that certain word-groups are of importance, you could preprocess the corpus yourself to combine those into single tokens, before (or instead of) the collocation-statistics-based Phrases process.

回答2:

If I understand what you're trying to do, you could try tf_idf on your corpus compared to the tf_idf of a larger, say, standard corpus (wikipedia or something).

from sklearn.feature_extraction.text import 

TfidfVectorizertfidf_vectorizer = 
    TfidfVectorizer(max_df=0.8, max_features=500,min_df=0.2,
                    stop_words='english', use_idf=True, ngram_range=(1,2))
X = tfidf_vectorizer.transform(docs)  # transform the documents to their tf_idf vectors

Look only at ngrams that have a very different value, this of course will only work if you have a large enough number of documents.

来源：https://stackoverflow.com/questions/47735393/gensim-phrases-usage-to-filter-n-grams

标签

python

nlp

word2vec

gensim