text-segmentation | 易学教程

Split a string to a string of valid words using Dynamic Programming

阅读更多关于 Split a string to a string of valid words using Dynamic Programming

问题 I need to find a dynamic programming algorithm to solve this problem. I tried but couldn't figure it out. Here is the problem: You are given a string of n characters s[1...n], which you believe to be a corrupted text document in which all punctuation has vanished (so that it looks something like "itwasthebestoftimes..."). You wish to reconstruct the document using a dictionary, which is available in the form of a Boolean function dict(*) such that, for any string w, dict(w) has value 1 if w

How to Split a Paragraph into Sentences

阅读更多关于 How to Split a Paragraph into Sentences

I've been trying to use: $string="The Dr. is here!!! I am glad I'm in the U.S.A. for the Dr. quality is great!!!!!!"; preg_match_all('~.*?[?.!]~s',$string,$sentences); print_r($sentences); But it doesn't work on Dr., U.S.A., etc. Does anyone have any better suggestions? there is not any simple solution for that. you need do some natural language processing(NLP) in your application and recognize each sentence. there is something call OpenNLP , it's a JAVA-based NLP parser tool. Or Stanford NLP parser in Ruby. you can find something like that for php. here I found a set of classes for natural

How to separate words in a “sentence” with spaces?

阅读更多关于 How to separate words in a “sentence” with spaces?

Background Looking to automate creating Domains in JasperServer. Domains are a "view" of data for creating ad hoc reports. The names of the columns must be presented to the user in a human readable fashion. Problem There are over 2,000 possible pieces of data from which the organization could theoretically want to include on a report. The data are sourced from non-human-friendly names such as: payperiodmatchcode labordistributioncodedesc dependentrelationship actionendoption actionendoptiondesc addresstype addresstypedesc historytype psaddresstype rolename bankaccountstatus

How to break up a paragraph by sentences in Python

阅读更多关于 How to break up a paragraph by sentences in Python

I need to parse sentences from a paragraph in Python. Is there an existing package to do this, or should I be trying to use regex here? The nltk.tokenize module is designed for this and handles edge cases. For example: >>> from nltk import tokenize >>> p = "Good morning Dr. Adams. The patient is waiting for you in room number 3." >>> tokenize.sent_tokenize(p) ['Good morning Dr. Adams.', 'The patient is waiting for you in room number 3.'] Here is how I am getting the first n sentences: def get_first_n_sentence(text, n): endsentence = ".?!" sentences = itertools.groupby(text, lambda x: any(x

Is there any good open-source or freely available Chinese segmentation algorithm available? [closed]

阅读更多关于 Is there any good open-source or freely available Chinese segmentation algorithm available? [closed]

As phrased in the question, I'm looking for a free and/or open-source text-segmentation algorithm for Chinese, I do understand it is a very difficult task to solve, as there are many ambiguities involed. I know there's google's API, but well it is rather a black-box, i.e. not many information of what it is doing are passing through. lschin The keyword text-segmentation for Chinese should be 中文分词 in Chinese. Good and active open-source text-segmentation algorithm : 盘古分词(Pan Gu Segment) : C# , Snapshot ik-analyzer : Java ICTCLAS : C/C++, Java, C# , Demo NlpBamboo : C, PHP, PostgreSQL HTTPCWS :

a Regex for extracting sentence from a paragraph in python

阅读更多关于 a Regex for extracting sentence from a paragraph in python

问题 I'm trying to extract a sentence from a paragraph using regular expressions in python. Usually the code that I'm testing extracts the sentence correctly, but in the following paragraph the sentence does not get extracted correctly. The paragraph: "But in the case of malaria infections and sepsis, dendritic cells throughout the body are concentrated on alerting the immune system, which prevents them from detecting and responding to any new infections." A new type of vaccine? The code: def

Java library that finds sentence boundaries

阅读更多关于 Java library that finds sentence boundaries

问题 Does anyone know of a Java library that handles finding sentence boundaries? I'm thinking that it would be a smart StringTokenizer implementation that knows about all of the sentence terminators that languages can use. Here's my experience with BreakIterator: Using the example here: I have the following Japanese: 今日はパソコンを買った。高性能のマックは早い！とても快適です。 In ascii, it looks like this: \ufeff\u4eca\u65e5\u306f\u30d1\u30bd\u30b3\u30f3\u3092\u8cb7\u3063\u305f\u3002\u9ad8\u6027\u80fd\u306e\u30de\u30c3\u30af

Python: Cut off the last word of a sentence?

阅读更多关于 Python: Cut off the last word of a sentence?

问题 What's the best way to slice the last word from a block of text? I can think of Split it to a list (by spaces) and removing the last item, then reconcatenating the list. Use a regular expression to replace the last word. I'm currently taking approach #1, but I don't know how to concatenate the list... content = content[position-1:position+249] # Content words = string.split(content, ' ') words = words[len[words] -1] # Cut of the last word Any code examples are much appreciated. 回答1: Actually

Python extract sentence containing word

阅读更多关于 Python extract sentence containing word

问题 I am trying to extract all the sentence containing a specified word from a text. txt="I like to eat apple. Me too. Let's go buy some apples." txt = "." + txt re.findall(r"\."+".+"+"apple"+".+"+"\.", txt) but it is returning me : [".I like to eat apple. Me too. Let's go buy some apples."] instead of : [".I like to eat apple., "Let's go buy some apples."] Any help please ? 回答1: In [3]: re.findall(r"([^.]*?apple[^.]*\.)",txt) Out[4]: ['I like to eat apple.', " Let's go buy some apples."] 回答2: No

How to get the first word of a sentence in PHP?

阅读更多关于 How to get the first word of a sentence in PHP?

I want to extract the first word of a variable from a string. For example, take this input: <?php $myvalue = 'Test me more'; ?> The resultant output should be Test , which is the first word of the input. How can I do this? You can use the explode function as follows: $myvalue = 'Test me more'; $arr = explode(' ',trim($myvalue)); echo $arr[0]; // will print Test salathe There is a string function ( strtok ) which can be used to split a string into smaller strings ( tokens ) based on some separator(s). For the purposes of this thread, the first word (defined as anything before the first space