Calculating context-sensitive text correlation

后端未结

关注

 5  2094

春和景丽 2021-01-01 01:18

Suppose I want to match address records (or person names or whatever) against each other to merge records that are most likely referring to the same address. Basically, I gu

5条回答

刺人心 (楼主)

2021-01-01 02:01
Disclaimer: I don't know any algorithm that does that, but would really be interested in knowing one if it exists. This answer is a naive attempt of trying to solve the problem, with no previous knowledge whatsoever. Comments welcome, please don't laugh too laud.

If you try doing it by hand, I would suggest applying some kind of "normalization" to your strings : lowercase them, remove punctuation, maybe replace common abbreviations with the full words (Dr. => drive, St => street, etc...).

Then, you can try different alignments between the two strings you compare, and compute the correlation by averaging the absolute differences between corresponding letters (eg a = 1, b = 2, etc.. and corr(a, b) = |a - b| = 1) :
```
west lawnmover drive
   w lawnmower street
```
Thus, even if some letters are different, the correlation would be high. Then, simply keep the maximal correlation you found, and decide that their are the same if the correlation is above a given threshold.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...