What is a good strategy to group similar words?

前端 未结 5 1744
孤城傲影
孤城傲影 2020-12-29 11:35

Say I have a list of movie names with misspellings and small variations like this -

 \"Pirates of the Caribbean: The Curse of the Black Pearl\"
 \"Pirates o         


        
5条回答
  •  失恋的感觉
    2020-12-29 11:53

    I believe there is in fact two distinct problems.

    The first is spell correction. You can have one in Python here

    http://norvig.com/spell-correct.html

    The second is more functional. Here is what I'd do after the spell correction. I would make a relation function.

    related( sentence1, sentence2 ) if and only if sentence1 and sentence2 have rare common words. By rare, I mean words different than (The, what, is, etc...). You can take a look at the TF/IDF system to determine if two document are related using their words. Just googling a bit I found this:

    https://code.google.com/p/tfidf/

提交回复
热议问题