How can I find only 'interesting' words from a corpus?

后端未结

关注

 4  1928

面向向阳花 2020-12-25 08:44

I am parsing sentences. I want to know the relevant content of each sentence, defined loosely as \"semi-unique words\" in relation to the rest of the corpus. Something simil

4条回答

遥遥无期 (楼主)

2020-12-25 09:04

TF-IDF is one way to go. If you want to talk about sentences rather than words, in addition to the excellent references above, here's a simple scheme:

Create a markov chain from a large sample corpus. In a nutshell, you construct a markov chain by recording the frequency of every n-tuple in your input text. For example, the sentence "this is a test" with 3-tuples would be (this, is, a), (is, a, test). Then, you group every n-tuple by the first n-1 terms, allowing you to answer the question "given the preceding n-1 words, what is the probability of the next word being this?"

Now, for every sentence in the input document, traverse the Markov chain. Calculate the probability of seeing the sentence by multiplying all the probabilities you encounter while traversing the chain together. This gives you an estimate of how 'probable' this sentence is in the input corpus. You may want to multiply this probability by the length of the sentence, as longer sentences are less likely, statistically.

Now you have associated with each sentence in your input a probability. Pick the n least probable sentences - these are the 'interesting' ones, for some definition of interesting.

0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...