text-segmentation

Text segmentation: dictionary-based word splitting [closed]

点点圈 提交于 2019-11-30 07:31:54
Background Split database column names into equivalent English text to seed a data dictionary. The English dictionary is created from a corpus of corporate documents, wikis, and email. The dictionary ( lexicon.csv ) is a CSV file with words and probabilities. Thus, the more often someone writes the word "therapist" (in email or on a wiki page) the higher the chance of "therapistname" splits to "therapist name" as opposed to something else. (The lexicon probably won't even include the word rapist.) Source Code TextSegmenter.java @ http://pastebin.com/taXyE03L SortableValueMap.java @ http:/

Splitting paragraphs into sentences with regexp and PHP

我的梦境 提交于 2019-11-30 05:02:59
I'm a regexp noob and trying to split paragraphs into sentences. In my language we use quite a bit of abbreviations (like: bl.a.) in the middle of sentences, so I have come to the conclusion, that what I need to do is to look for punctuations, that are followed by a single space and then a word that starts with a capital letter like: [sentence1]...anymore. However...[sentence2] So a paragraph like: Der er en lang og bevæget forhistorie bag lov om varsling m.v. i forbindelse med afskedigelser af større omfang. Det er ikke en bureaukratisk lovtekst blandt så mange andre. Should end in this

Split a sentence into separate words

守給你的承諾、 提交于 2019-11-30 00:33:54
I need to split a Chinese sentence into separate words. The problem with Chinese is that there are no spaces. For example, the sentence may look like: 主楼怎么走 (with spaces it would be: 主楼 怎么 走 ). At the moment I can think of one solution. I have a dictionary with Chinese words (in a database). The script will: try to find the first two characters of the sentence in the database ( 主楼 ), if 主楼 is actually a word and it's in the database the script will try to find first three characters ( 主楼怎 ). 主楼怎 is not a word, so it's not in the database => my application now knows that 主楼 is a separate word.

Split a string to a string of valid words using Dynamic Programming

╄→гoц情女王★ 提交于 2019-11-29 21:01:45
I need to find a dynamic programming algorithm to solve this problem. I tried but couldn't figure it out. Here is the problem: You are given a string of n characters s[1...n], which you believe to be a corrupted text document in which all punctuation has vanished (so that it looks something like "itwasthebestoftimes..."). You wish to reconstruct the document using a dictionary, which is available in the form of a Boolean function dict(*) such that, for any string w, dict(w) has value 1 if w is a valid word, and has value 0 otherwise. Give a dynamic programming algorithm that determines whether

Text segmentation: dictionary-based word splitting [closed]

匆匆过客 提交于 2019-11-29 10:06:10
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 4 years ago . Background Split database column names into equivalent English text to seed a data dictionary. The English dictionary is created from a corpus of corporate documents, wikis, and email. The dictionary ( lexicon.csv ) is a CSV file with words and probabilities. Thus, the more often

a Regex for extracting sentence from a paragraph in python

笑着哭i 提交于 2019-11-29 08:03:04
I'm trying to extract a sentence from a paragraph using regular expressions in python. Usually the code that I'm testing extracts the sentence correctly, but in the following paragraph the sentence does not get extracted correctly. The paragraph: "But in the case of malaria infections and sepsis, dendritic cells throughout the body are concentrated on alerting the immune system, which prevents them from detecting and responding to any new infections." A new type of vaccine? The code: def splitParagraphIntoSentences(paragraph): import re sentenceEnders = re.compile('[.!?][\s]{1,2}(?=[A-Z])')

Java library that finds sentence boundaries

孤者浪人 提交于 2019-11-29 04:28:40
Does anyone know of a Java library that handles finding sentence boundaries? I'm thinking that it would be a smart StringTokenizer implementation that knows about all of the sentence terminators that languages can use. Here's my experience with BreakIterator: Using the example here : I have the following Japanese: 今日はパソコンを買った。高性能のマックは早い!とても快適です。 In ascii, it looks like this: \ufeff\u4eca\u65e5\u306f\u30d1\u30bd\u30b3\u30f3\u3092\u8cb7\u3063\u305f\u3002\u9ad8\u6027\u80fd\u306e\u30de\u30c3\u30af\u306f\u65e9\u3044\uff01\u3068\u3066\u3082\u5feb\u9069\u3067\u3059\u3002 Here's the part of that

Python extract sentence containing word

雨燕双飞 提交于 2019-11-29 04:20:57
I am trying to extract all the sentence containing a specified word from a text. txt="I like to eat apple. Me too. Let's go buy some apples." txt = "." + txt re.findall(r"\."+".+"+"apple"+".+"+"\.", txt) but it is returning me : [".I like to eat apple. Me too. Let's go buy some apples."] instead of : [".I like to eat apple., "Let's go buy some apples."] Any help please ? In [3]: re.findall(r"([^.]*?apple[^.]*\.)",txt) Out[4]: ['I like to eat apple.', " Let's go buy some apples."] No need for regex: >>> txt = "I like to eat apple. Me too. Let's go buy some apples." >>> [sentence + '.' for

How to split paragraphs into sentences?

空扰寡人 提交于 2019-11-29 02:26:07
Please have a look at the following. String[]sentenceHolder = titleAndBodyContainer.split("\n|\\.(?!\\d)|(?<!\\d)\\."); This is how I tried to split a paragraph into sentences. But, there is a problem. My paragraph includes dates like Jan. 13, 2014 , words like U.S and numbers like 2.2 . They all got splitted by the above code. So basically, this code splits lot of 'dots' whether it is a full stop or not. I tried String[]sentenceHolder = titleAndBodyContainer.split(".\n"); and String[]sentenceHolder = titleAndBodyContainer.split("\\."); as well. All failed. How can I split a paragraph into

Python: Cut off the last word of a sentence?

家住魔仙堡 提交于 2019-11-28 20:05:16
What's the best way to slice the last word from a block of text? I can think of Split it to a list (by spaces) and removing the last item, then reconcatenating the list. Use a regular expression to replace the last word. I'm currently taking approach #1, but I don't know how to concatenate the list... content = content[position-1:position+249] # Content words = string.split(content, ' ') words = words[len[words] -1] # Cut of the last word Any code examples are much appreciated. Actually you don't need to split all words. You can split you text by last space symbol into two parts using rsplit .