Gensim Phrases usage to filter n-grams

久未见 提交于 2019-12-11 06:07:25

问题


I am using Gensim Phrases to identify important n-grams in my text as follows.

bigram = Phrases(documents, min_count=5)
trigram = Phrases(bigram[documents], min_count=5)

for sent in documents:
    bigrams_ = bigram[sent]
    trigrams_ = trigram[bigram[sent]]

However, this detects uninteresting n-grams such as special issue, important matter, high risk etc. I am particularly, interested in detecting concepts in the text such as machine learning, human computer interaction etc.

Is there a way to stop phrases detecting uninteresting n-grams as I have mentioned above in my example?


回答1:


Phrases has a configurable threshold parameter which adjusts the statistical cutoff for promoting word-pairs into phrases. (Larger thresholds mean fewer pairs become phrases.)

You can adjust that to try to make a greater proportion of its promoted phrases match your own ad hoc intuition about "interesting" phrases – but this class is still using a fairly crude method, without any awareness of grammar or domain knowledge beyond what's in the corpus. So any value that gets all/most of the phrases you want will likely include many uninteresting ones, or vice-versa.

If you have a priori knowledge that certain word-groups are of importance, you could preprocess the corpus yourself to combine those into single tokens, before (or instead of) the collocation-statistics-based Phrases process.




回答2:


If I understand what you're trying to do, you could try tf_idf on your corpus compared to the tf_idf of a larger, say, standard corpus (wikipedia or something).

from sklearn.feature_extraction.text import 

TfidfVectorizertfidf_vectorizer = 
    TfidfVectorizer(max_df=0.8, max_features=500,min_df=0.2,
                    stop_words='english', use_idf=True, ngram_range=(1,2))
X = tfidf_vectorizer.transform(docs)  # transform the documents to their tf_idf vectors

Look only at ngrams that have a very different value, this of course will only work if you have a large enough number of documents.



来源:https://stackoverflow.com/questions/47735393/gensim-phrases-usage-to-filter-n-grams

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!