Find the words in a long stream of characters. Auto-tokenize

前端 未结 5 2191
梦谈多话
梦谈多话 2021-02-04 09:44

How would you find the correct words in a long stream of characters?

Input :

\"The revised report onthesyntactictheoriesofsequentialcontrolandstate\"
         


        
5条回答
  •  忘掉有多难
    2021-02-04 10:30

    After doing the recursive splitting and dictionary lookup, to increase the quality of word pairs in your your phrase you might be interested to employ Mutual information of Word pairs.

    This is essentially going though a training set and finding out M.I. values of word pairs that tells you that Albert Simpson is less Likely than Albert Einstein :)

    You can try searching Science Direct for academic papers in this theme. For basic information on Mutual information see http://en.wikipedia.org/wiki/Mutual_information

    Last year I had been involved in the phrase search part of a search engine project in which I was trying to parse though wikipedia dataset and rank each word pair. I've got the code in C++ if you care could share it with you if you can find some use of it. It parses wikimedia and for every word pair finds out the mutual information.

提交回复
热议问题