Fuzzy String Comparison

后端未结

关注

 4  1043

情歌与酒 2020-11-29 17:01

What I am striving to complete is a program which reads in a file and will compare each sentence according to the original sentence. The sentence which is a perfect match to

4条回答

醉梦人生 (楼主)

2020-11-29 17:23
The task is called Paraphrase Identification which is an active area of research in Natural Language Processing. I have linked several state of the art papers many of which you can find open source code on GitHub for.

Note that all the answered question assume that there is some string/surface similarity between the two sentences while in reality two sentences with little string similarity can be semantically similar.

If you're interested in that kind of similarity you can use Skip-Thoughts. Install the software according to the GitHub guides and go to paraphrase detection section in readme:
```
import skipthoughts
model = skipthoughts.load_model()
vectors = skipthoughts.encode(model, X_sentences)
```
This converts your sentences (X_sentences) to vectors. Later you can find the similarity of two vectors by:
```
similarity = 1 - scipy.spatial.distance.cosine(vectors[0], vectors[1])
```
where we are assuming vector[0] and vector1 are the corresponding vector to X_sentences[0], X_sentences1 which you wanted to find their scores.

There are other models to convert a sentence to a vector which you can find here.

Once you convert your sentences into vectors the similarity is just a matter of finding the Cosine similarity between those vectors.

Update in 2020 There is this new model called BERT released by Google based on a deep learning framework called Tensorflow. There is also an implementation that many people find easier to use called Transformers. What these programs do, is that they accept two phrases or sentences, and they are able to be trained to say if these two phrases/sentences are the same or not. To train them, you need a number of sentences with labels 1 or 0 (if they have the same meaning or not). You train these models using your training data (already labelled data), and then you'll be able to use the trained model to make prediction for a new pair of phrases/sentences. You can find how to train (they call it fine-tune) these models on their corresponding github pages or in many other places such as this.

There are also already labelled training data available in English called MRPC (microsoft paraphrase identification corpus). Note that there multilingual or language-specific versions of BERT also exists so this model can be extended (e.g. trained) in other languages as well.
0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...