Compare 5000 strings with PHP Levenshtein

前端 未结 8 1755
长情又很酷
长情又很酷 2021-01-30 12:15

I have 5000, sometimes more, street address strings in an array. I\'d like to compare them all with levenshtein to find similar matches. How can I do this without looping throug

8条回答
  •  半阙折子戏
    2021-01-30 12:28

    I think a better way to group similar addresses would be to:

    1. create a database with two tables - one for the address (and a id), one for the soundexes of words or literal numbers in the address (with the foreign key of the addresses table)

    2. uppercase the address, replace anything other than [A-Z] or [0-9] with a space

    3. split the address by space, calculate the soundex of each 'word', leave anything with just digits as is and store it in the soundexes table with the foreign key of the address you started with

    4. for each address (with id $target) find the most similar addresses:

      SELECT similar.id, similar.address, count(*) 
      FROM adress similar, word cmp, word src
      WHERE src.address_id=$target
      AND src.soundex=cmp.soundex
      AND cmp.address_id=similar.id
      ORDER BY count(*)
      LIMIT $some_value;
      
    5. calculate the levenstein difference between your source address and the top few values returned by the query.

    (doing any sort of operation on large arrays is often faster in databases)

提交回复
热议问题