text-segmentation

Split a string to a string of valid words using Dynamic Programming

二次信任 提交于 2019-11-28 17:13:21
问题 I need to find a dynamic programming algorithm to solve this problem. I tried but couldn't figure it out. Here is the problem: You are given a string of n characters s[1...n], which you believe to be a corrupted text document in which all punctuation has vanished (so that it looks something like "itwasthebestoftimes..."). You wish to reconstruct the document using a dictionary, which is available in the form of a Boolean function dict(*) such that, for any string w, dict(w) has value 1 if w

How to Split a Paragraph into Sentences

99封情书 提交于 2019-11-28 14:15:26
I've been trying to use: $string="The Dr. is here!!! I am glad I'm in the U.S.A. for the Dr. quality is great!!!!!!"; preg_match_all('~.*?[?.!]~s',$string,$sentences); print_r($sentences); But it doesn't work on Dr., U.S.A., etc. Does anyone have any better suggestions? there is not any simple solution for that. you need do some natural language processing(NLP) in your application and recognize each sentence. there is something call OpenNLP , it's a JAVA-based NLP parser tool. Or Stanford NLP parser in Ruby. you can find something like that for php. here I found a set of classes for natural

How to separate words in a “sentence” with spaces?

。_饼干妹妹 提交于 2019-11-28 13:23:02
Background Looking to automate creating Domains in JasperServer. Domains are a "view" of data for creating ad hoc reports. The names of the columns must be presented to the user in a human readable fashion. Problem There are over 2,000 possible pieces of data from which the organization could theoretically want to include on a report. The data are sourced from non-human-friendly names such as: payperiodmatchcode labordistributioncodedesc dependentrelationship actionendoption actionendoptiondesc addresstype addresstypedesc historytype psaddresstype rolename bankaccountstatus

How to break up a paragraph by sentences in Python

微笑、不失礼 提交于 2019-11-28 07:45:31
I need to parse sentences from a paragraph in Python. Is there an existing package to do this, or should I be trying to use regex here? The nltk.tokenize module is designed for this and handles edge cases. For example: >>> from nltk import tokenize >>> p = "Good morning Dr. Adams. The patient is waiting for you in room number 3." >>> tokenize.sent_tokenize(p) ['Good morning Dr. Adams.', 'The patient is waiting for you in room number 3.'] Here is how I am getting the first n sentences: def get_first_n_sentence(text, n): endsentence = ".?!" sentences = itertools.groupby(text, lambda x: any(x

Is there any good open-source or freely available Chinese segmentation algorithm available? [closed]

我只是一个虾纸丫 提交于 2019-11-28 03:12:39
As phrased in the question, I'm looking for a free and/or open-source text-segmentation algorithm for Chinese, I do understand it is a very difficult task to solve, as there are many ambiguities involed. I know there's google's API, but well it is rather a black-box, i.e. not many information of what it is doing are passing through. lschin The keyword text-segmentation for Chinese should be 中文分词 in Chinese. Good and active open-source text-segmentation algorithm : 盘古分词(Pan Gu Segment) : C# , Snapshot ik-analyzer : Java ICTCLAS : C/C++, Java, C# , Demo NlpBamboo : C, PHP, PostgreSQL HTTPCWS :

a Regex for extracting sentence from a paragraph in python

蓝咒 提交于 2019-11-28 01:42:04
问题 I'm trying to extract a sentence from a paragraph using regular expressions in python. Usually the code that I'm testing extracts the sentence correctly, but in the following paragraph the sentence does not get extracted correctly. The paragraph: "But in the case of malaria infections and sepsis, dendritic cells throughout the body are concentrated on alerting the immune system, which prevents them from detecting and responding to any new infections." A new type of vaccine? The code: def

Java library that finds sentence boundaries

我的未来我决定 提交于 2019-11-27 22:24:44
问题 Does anyone know of a Java library that handles finding sentence boundaries? I'm thinking that it would be a smart StringTokenizer implementation that knows about all of the sentence terminators that languages can use. Here's my experience with BreakIterator: Using the example here: I have the following Japanese: 今日はパソコンを買った。高性能のマックは早い!とても快適です。 In ascii, it looks like this: \ufeff\u4eca\u65e5\u306f\u30d1\u30bd\u30b3\u30f3\u3092\u8cb7\u3063\u305f\u3002\u9ad8\u6027\u80fd\u306e\u30de\u30c3\u30af

Python: Cut off the last word of a sentence?

狂风中的少年 提交于 2019-11-27 20:38:34
问题 What's the best way to slice the last word from a block of text? I can think of Split it to a list (by spaces) and removing the last item, then reconcatenating the list. Use a regular expression to replace the last word. I'm currently taking approach #1, but I don't know how to concatenate the list... content = content[position-1:position+249] # Content words = string.split(content, ' ') words = words[len[words] -1] # Cut of the last word Any code examples are much appreciated. 回答1: Actually

Python extract sentence containing word

百般思念 提交于 2019-11-27 17:47:16
问题 I am trying to extract all the sentence containing a specified word from a text. txt="I like to eat apple. Me too. Let's go buy some apples." txt = "." + txt re.findall(r"\."+".+"+"apple"+".+"+"\.", txt) but it is returning me : [".I like to eat apple. Me too. Let's go buy some apples."] instead of : [".I like to eat apple., "Let's go buy some apples."] Any help please ? 回答1: In [3]: re.findall(r"([^.]*?apple[^.]*\.)",txt) Out[4]: ['I like to eat apple.', " Let's go buy some apples."] 回答2: No

How to get the first word of a sentence in PHP?

a 夏天 提交于 2019-11-27 17:25:01
I want to extract the first word of a variable from a string. For example, take this input: <?php $myvalue = 'Test me more'; ?> The resultant output should be Test , which is the first word of the input. How can I do this? You can use the explode function as follows: $myvalue = 'Test me more'; $arr = explode(' ',trim($myvalue)); echo $arr[0]; // will print Test salathe There is a string function ( strtok ) which can be used to split a string into smaller strings ( tokens ) based on some separator(s). For the purposes of this thread, the first word (defined as anything before the first space