How to extract common / significant phrases from a series of text entries

夙愿已清 提交于 2019-11-28 02:35:13
dmcer

I suspect you don't just want the most common phrases, but rather you want the most interesting collocations. Otherwise, you could end up with an overrepresentation of phrases made up of common words and fewer interesting and informative phrases.

To do this, you'll essentially want to extract n-grams from your data and then find the ones that have the highest point wise mutual information (PMI). That is, you want to find the words that co-occur together much more than you would expect them to by chance.

The NLTK collocations how-to covers how to do this in a about 7 lines of code, e.g.:

import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()

# change this to read in your data
finder = BigramCollocationFinder.from_words(
    nltk.corpus.genesis.words('english-web.txt'))

# only bigrams that appear 3+ times
finder.apply_freq_filter(3)

# return the 10 n-grams with the highest PMI
finder.nbest(bigram_measures.pmi, 10)

I think what you're looking for is chunking. I recommended reading chapter 7 of the NLTK book or maybe my own article on chunk extraction. Both of these assume knowledge of part-of-speech tagging, which is covered in chapter 5.

if you just want to get to larger than 3 ngrams you can try this. I'm assuming you've stripped out all the junk like html etc.

import nltk
ngramlist=[]
raw=<yourtextfile here>

x=1
ngramlimit=6
tokens=nltk.word_tokenize(raw)

while x <= ngramlimit:
  ngramlist.extend(nltk.ngrams(tokens, x))
  x+=1

Probably not very pythonic as I've only been doing this a month or so myself, but might be of help!

Tomislav Nakic-Alfirevic

Well, for a start you would probably have to remove all HTML tags (search for "<[^>]*>" and replace it with ""). After that, you could try the naive approach of looking for the longest common substrings between every two text items, but I don't think you'd get very good results. You might do better by normalizing the words (reducing them to their base form, removing all accents, setting everything to lower or upper case) first and then analyse. Again, depending on what you want to accomplish, you might be able to cluster the text items better if you allow for some word order flexibility, i.e. treat the text items as bags of normalized words and measure bag content similarity.

I've commented on a similar (although not identical) topic here.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!