Calculating context-sensitive text correlation

后端 未结 5 2080
春和景丽
春和景丽 2021-01-01 01:18

Suppose I want to match address records (or person names or whatever) against each other to merge records that are most likely referring to the same address. Basically, I gu

5条回答
  •  刺人心
    刺人心 (楼主)
    2021-01-01 02:01

    Disclaimer: I don't know any algorithm that does that, but would really be interested in knowing one if it exists. This answer is a naive attempt of trying to solve the problem, with no previous knowledge whatsoever. Comments welcome, please don't laugh too laud.

    If you try doing it by hand, I would suggest applying some kind of "normalization" to your strings : lowercase them, remove punctuation, maybe replace common abbreviations with the full words (Dr. => drive, St => street, etc...).

    Then, you can try different alignments between the two strings you compare, and compute the correlation by averaging the absolute differences between corresponding letters (eg a = 1, b = 2, etc.. and corr(a, b) = |a - b| = 1) :

    west lawnmover drive
       w lawnmower street
    

    Thus, even if some letters are different, the correlation would be high. Then, simply keep the maximal correlation you found, and decide that their are the same if the correlation is above a given threshold.

提交回复
热议问题