English word segmentation in NLP?

久未见 提交于 2019-12-04 12:59:06

Rather than tokenization, what it sounds like you really want to do is called word segmentation. This is for example a way to make sense of asentencethathasnospaces.

I haven't gone through this entire tutorial, but this should get you started. They even give urls as a potential use case.

http://jeremykun.com/2012/01/15/word-segmentation/

The Python wordsegment module can do this. It's an Apache2 licensed module for English word segmentation, written in pure-Python, and based on a trillion-word corpus.

Based on code from the chapter “Natural Language Corpus Data” by Peter Norvig from the book “Beautiful Data” (Segaran and Hammerbacher, 2009).

Data files are derived from the Google Web Trillion Word Corpus, as described by Thorsten Brants and Alex Franz, and distributed by the Linguistic Data Consortium. This module contains only a subset of that data. The unigram data includes only the most common 333,000 words. Similarly, bigram data includes only the most common 250,000 phrases. Every word and phrase is lowercased with punctuation removed.

Installation is easy with pip:

$ pip install wordsegment

Simply call segment to get a list of words:

>>> import wordsegment as ws
>>> ws.segment('http://ads.goole.com/appid/heads')
['http', 'ads', 'goole', 'com', 'appid', 'heads']

As you noticed, the old corpus doesn't rank "app id" very high. That's ok. We can easily teach it. Simply add it to the bigram_counts dictionary.

>>> ws.bigram_counts['app id'] = 10.2e6
>>> ws.segment('http://ads.goole.com/appid/heads')
['http', 'ads', 'goole', 'com', 'app', 'id', 'heads']

I chose the value 10.2e6 by doing a Google search for "app id" and noting the number of results.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!