What NLP tools to use to match phrases having similar meaning or semantics

浪尽此生 提交于 2019-12-02 23:16:41
David Batista

If you have a big corpus, where these words occur, available, you can train a model to represent each word as vector. For instance, you can use deep learning via word2vec’s "skip-gram and CBOW models", they are implemented in the gensim software package

In the word2vec model, each word is represented by a vector, you can then measure the semantic similarity between two words by measuring the cosine of the vectors representing th words. Semantic similar words should have a high cosine similarity, for instance:

model.similarity('cheap','inexpensive') = 0.8

(The value is made up, just for illustration.)

Also, from my experiments, summing a relatively small number of words (i.e., up to 3 or 4 words) preserves the semantics, for instance:

vector1 = model['cheap']+model['health']+model['insurance']
vector2 = model['low']+model['cost']+model['medical']+model['insurance']

similarity(vector1,vector2) = 0.7

(Again, just for illustration.)

You can use this semantic similarity measure between words as a measure to generate your clusters.

Gabriel

When Latent Semantic Analysis refers to a "document", it basically means any set of words that is longer than 1. You can use it to compute the similarity between a document and another document, between a word and another word, or between a word and a document. So you could certainly use it for your chosen application.

Other algorithms that may be useful include:

I'd start by taking a look at Wordnet. It will give you real synonyms and other word relations for hundreds of thousands of terms. Since you tagged the nltk: It provides bindings for Wordnet, and you can use it as the basis for domain-specific solutions.

Still in the NLTK, check out the discussion of the method similar() in the introduction to the NLTK book, and the class nltk.text.ContextIndex that it's based on. (All pretty simple still, but it might be all you really need).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!