Statistical language model: comparing word sequences of different lengths

问题

I have an algorithm that extracts company names from text. It generally does a good job, however, it also sometimes extracts strings that look like company names, but obviously aren't. For example, "Contact Us", "Colorado Springs CO", "Cosmetic Dentist" are obviously not company names. There are too many of such false positives to blacklist, so I want to introduce an algorithmic way of ranking the extracted strings, so that the lowest-ranking ones can be discarded.

Currently, I'm thinking of using a statistical language model to do this. This model can score each string based on the product of the probabilities of each individual word in the string (considering the simplest unigram model). My question is: can such a model be used to compare word sequences of different lengths? Since probabilities are by definition less than 1, the probabilities for longer sequences is usually going to be smaller than for shorter sequences. This would bias the model against longer sequences, which isn't a good thing.

Is there a way to compare word sequences of different lengths using such statistical language models? Alternatively, is there a better way to achieve to score the sequences?

For example, with a bigram model and some existing data, this is what I get:

python slm.py About NEC
        <s> about 6
        about nec 1
        nec </s> 1
4.26701019773e-17
python slm.py NEC
        <s> nec 6
        nec </s> 1
2.21887517189e-11
python slm.py NEC Corporation
        <s> nec 6
        nec corporation 3
        corporation </s> 3593
4.59941029214e-13
python slm.py NEC Corporation of
        <s> nec 6
        nec corporation 3
        corporation of 41
        of </s> 1
1.00929844083e-20
python slm.py NEC Corporation of America
        <s> nec 6
        nec corporation 3
        corporation of 41
        of america 224
        america </s> 275
1.19561436587e-21

The indented lines show the bigrams and their frequency in the model. <s> and </s> are start and end of sentence, respectively. The problem is, the longer the sentence, the less probable it is, regardless of how often its constituent bigrams occur in the database.

回答1:

Can you normalize the scores based on sentence lengths, or use EM algorithm over unigram, bigram and trigram models?

Edit on 9/24:

There are probably a few alternatives you could try. One way is to make maximum-likelihood estimates on unigram, bigram and trigram models and take a linear interpolation(See: http://www.cs.columbia.edu/~mcollins/lm-spring2013.pdf). For each word at position i, you can determine if (i+1) is the end of the sentence or which word mostly likely would appear. This method requires you to set up training and testing data sets to evaluate the performance (perplexity).

I would avoid simple multiplications of the probabilities of each individual word. When words are not independent, for example, P (NEC, Corporation) != P (NEC) * P (Corporation).

来源：https://stackoverflow.com/questions/18928243/statistical-language-model-comparing-word-sequences-of-different-lengths

标签

statistics

nlp

modeling