Word splitting statistical approach

試著忘記壹切 提交于 2019-12-03 13:24:53

问题


I want to solve word splitting problem (parse words from long string with no spaces). For examle we want extract words from somelongword to [some, long, word].

We can achieve this by some dynamic approach with dictionary, but another issue we encounter is parsing ambiguity. I.e. orcore => or core or orc ore (We don't take into account phrase meaning or part of speech). So i think about usage of some statistical or ML approach.

I found that Naive Bayes and Viterbi algorithm with train set can be used for solving this. Can you point me some information about application of these algorithms to word splitting problem?

UPD: I've implemented this method on Clojure, using some advices from Peter Norvig's code


回答1:


I think that slideshow by Peter Norvig and Sebastian Thurn is a good point to start. It presents real-world work made by google.




回答2:


This problem is entirely analagous to word segmentation in many Asian languages that don't explicitly encode word boundaries (e.g. Chinese, Thai). If you want background on approaches to the problem, I'd recommend you look at Google Scholar for current Chinese Word Segmentation approaches.

You might start by looking at some older approaches: Sproat, Richard and Thomas Emerson. 2003. The first international Chinese word segmentation bakeoff (http://www.sighan.org/bakeoff2003/paper.pdf)

If you want a ready-made solution, I'd recommend LingPipe's tutorial (http://alias-i.com/lingpipe/demos/tutorial/chineseTokens/read-me.html). I've used it on unsegmented English text with good results. I trained the underlying character language model on a couple million words of newswire text, but I suspect that for this task you'll get reasonable performance using any corpus of relatively normal English text.

They used a spelling-correction system to recommend candidate 'corrections' (where the candidate corrections are identical to the input but with spaces inserted). Their spelling corrector is based on Levenshtein edit distance; they just disallow substitution and transposition, and restrict allowable insertions to only a single space.



来源:https://stackoverflow.com/questions/9665501/word-splitting-statistical-approach

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!