What is a good strategy to group similar words?

前端 未结 5 1745
孤城傲影
孤城傲影 2020-12-29 11:35

Say I have a list of movie names with misspellings and small variations like this -

 \"Pirates of the Caribbean: The Curse of the Black Pearl\"
 \"Pirates o         


        
5条回答
  •  春和景丽
    2020-12-29 12:02

    One approach would be to pre-process all the strings before you compare them: convert all to lowercase, standardize whitespace (eg, replace any whitespace with single spaces). If punctuation is not important to your end goal, you can remove all punctuation characters as well.

    Levenshtein distance is commonly-used to determine similarity of a string, this should help you group strings which differ by small spelling errors.

提交回复
热议问题