How to search a corpus to find frequency of a string?

问题

I'm working on an NLP project and I'd like to search through a corpus of text to try to find the frequency of a given verb-object pair.

The aim would be to find which verb-object pair is most likely when given a few different possibilities. For example, if given the strings "Swing the stick" and "Eat the stick" I would hope that the corpus would show it's much more likely for someone to swing a stick than eat one.

I've been reading about n-grams and corpus linguistics but I'm struggling to find some way of performing this type of search using Java, are there any APIs that may be useful?

回答1:

If you are looking for string correlations and frequencies, you might be able to make do with a very simple model using TF-IDF metrics and cosine similarity. I think you can do this in a very simple way if you split your strings into small chunks, and let each string represent a document.

In a nutshell TF is Term Frequency - which counts the total number of times a word exists in a given document. So considering your example, and adding some more info to it:

Doc1: swing the stick. eat the stick of carrot.

Doc2: swing the stick of gum.

The TF values are: Doc1:

swing: 1
the: 2
stick: 2
eat: 1
of: 1
carrot: 1

Doc2:

swing:1
the:1
stick:1 
of:1 
gum:1

IDF is the inverse document frequency. Which is - how many documents does a given word exist in? This metric is used to help us remove the boas for words like "the" and "of" that are very frequent, but dont give us a lot of linguistic information.

Coming back to your example: Doc1: swing the stick. eat the stick of carrot.

Doc2: swing the stick of gum.

The IDF values are (generic to all the documents):

swing: 2 (it occurs in 2 documents)
the: 2
stick: 2
eat: 1
of: 2
carrot: 1
gum:1

using this compute the TF * IDF value for each of the words in the document, and develop a vector to represent the document: swing: 2 (it occurs in 2 documents)

Doc1:

the: (TF:2 * IDF:2) = 4 
stick: (TF:2 * IDF:2) = 4
eat: (TF: 1 * IDF:1) = 1
of: (TF: 1 * IDF:2) = 2
carrot: (TF: 1 * IDF:1) = 1
gum:(TF:0 * IDF:1) = 0 (gum doesnt exist in doc1 so TF=0)

Doc2:

the: (TF:1 * IDF:2) = 2
stick: (TF:1 * IDF:2) = 2
eat: (TF:0 * IDF:1) = 0
of: (TF:1 * IDF:2) = 2
carrot: (TF:0 * IDF:1) = 0
gum:(TF:1 * IDF:1) =1

Now that you have the vectors representing each of the documents, you can compute the similarity between them by computing the angle between the vectors by computing the dot product between them.

Doc2 . Doc1 = (Order does not matter)

the = (doc2: 2 * doc1: 4) = 8
stick: (doc2: 2 * doc1: 4) = 8
eat: (doc2: 0 * doc1: 1) = 0
of: (doc2: 2 * doc1: 2) = 4
carrot: (doc2: 0 * doc1: 1) = 0
gum:(doc2:1 * doc1: 0) = 0

The magnitude of the vector is the square root of the sum of squares between the 2 vectors: in this case, the magnitude of the distance between Doc1 and Doc 2 is:

root(8^2 + 8^2 + 0^2 + 4^2 + 0^2 + 0^2) = 12

Once you have the magnitude of the distance between all your documents or strings, you can find out which ones are most similar, and have the most probability of occurring next to each other. The lesser the magnitude of the distance between the 2 strings, the closer they are. If 2 strings are closer in magnitude, they are similar.

The TF and IDF scores are often converted to log values so that it is easy to compute the downstream functions on it.

There is an excellent tutorial in the Stanford Information Retrieval book (chapter 6) available here: http://nlp.stanford.edu/IR-book/

Additionally, there is some code in perl, and some quick and dirty explanation here: http://nlp-stuff.blogspot.com/2012/09/toy-example-for-computing-document.html http://nlp-stuff.blogspot.com/2012/09/toy-example-for-computing-tfidf.html

来源：https://stackoverflow.com/questions/23030234/how-to-search-a-corpus-to-find-frequency-of-a-string

标签

java

nlp

n-gram

corpus