Word splitting statistical approach

白昼怎懂夜的黑 提交于 2019-12-03 03:34:32

I think that slideshow by Peter Norvig and Sebastian Thurn is a good point to start. It presents real-world work made by google.

This problem is entirely analagous to word segmentation in many Asian languages that don't explicitly encode word boundaries (e.g. Chinese, Thai). If you want background on approaches to the problem, I'd recommend you look at Google Scholar for current Chinese Word Segmentation approaches.

You might start by looking at some older approaches: Sproat, Richard and Thomas Emerson. 2003. The first international Chinese word segmentation bakeoff (http://www.sighan.org/bakeoff2003/paper.pdf)

If you want a ready-made solution, I'd recommend LingPipe's tutorial (http://alias-i.com/lingpipe/demos/tutorial/chineseTokens/read-me.html). I've used it on unsegmented English text with good results. I trained the underlying character language model on a couple million words of newswire text, but I suspect that for this task you'll get reasonable performance using any corpus of relatively normal English text.

They used a spelling-correction system to recommend candidate 'corrections' (where the candidate corrections are identical to the input but with spaces inserted). Their spelling corrector is based on Levenshtein edit distance; they just disallow substitution and transposition, and restrict allowable insertions to only a single space.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!