text-segmentation

opencv - cropping handwritten lines (line segmentation)

女生的网名这么多〃 提交于 2019-11-27 16:23:13
问题 I'm trying to build a handwriting recognition system using python and opencv. The recognition of the characters is not the problem but the segmentation. I have successfully : segmented a word into single characters segmented a single sentence into words in the required order. But I couldn't segment different lines in the document. I tried sorting the contours (to avoid line segmentation and use only word segmentation) but it didnt work. I have used the following code to segment words

Explode a paragraph into sentences in PHP

喜你入骨 提交于 2019-11-27 09:16:22
I have been using explode(".",$mystring) to split a paragraph into sentences. However this doen't cover sentences that have been concluded with different punctuation such as ! ? : ; Is there a way of using an array as a delimiter instead of a single character? Alternativly is there another neat way of splitting using various punctuation? I tried explode(("." || "?" || "!"),$mystring) hopefully but it didn't work... codaddict You can do: preg_split('/\.|\?|!/',$mystring); or (simpler): preg_split('/[.?!]/',$mystring); You can use preg_split() combined with a PCRE lookahead condition to split

How to Split a Paragraph into Sentences

独自空忆成欢 提交于 2019-11-27 08:16:20
问题 I've been trying to use: $string="The Dr. is here!!! I am glad I'm in the U.S.A. for the Dr. quality is great!!!!!!"; preg_match_all('~.*?[?.!]~s',$string,$sentences); print_r($sentences); But it doesn't work on Dr., U.S.A., etc. Does anyone have any better suggestions? 回答1: there is not any simple solution for that. you need do some natural language processing(NLP) in your application and recognize each sentence. there is something call OpenNLP, it's a JAVA-based NLP parser tool. Or Stanford

How to separate words in a “sentence” with spaces?

徘徊边缘 提交于 2019-11-27 07:38:05
问题 Background Looking to automate creating Domains in JasperServer. Domains are a "view" of data for creating ad hoc reports. The names of the columns must be presented to the user in a human readable fashion. Problem There are over 2,000 possible pieces of data from which the organization could theoretically want to include on a report. The data are sourced from non-human-friendly names such as: payperiodmatchcode labordistributioncodedesc dependentrelationship actionendoption

Is there any good open-source or freely available Chinese segmentation algorithm available? [closed]

折月煮酒 提交于 2019-11-27 05:04:57
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 5 years ago . As phrased in the question, I'm looking for a free and/or open-source text-segmentation algorithm for Chinese, I do understand it is a very difficult task to solve, as there are many ambiguities involed. I know there's google's API, but well it is rather a black-box, i.e. not many information of what it is doing

How to split a string into words. Ex: “stringintowords” -> “String Into Words”?

痞子三分冷 提交于 2019-11-27 03:04:19
What is the right way to split a string into words ? (string doesn't contain any spaces or punctuation marks) For example: "stringintowords" -> "String Into Words" Could you please advise what algorithm should be used here ? ! Update: For those who think this question is just for curiosity. This algorithm could be used to camеlcase domain names ("sportandfishing .com" -> "SportAndFishing .com") and this algo is currently used by aboutus dot org to do this conversion dynamically. As mentioned by many people here, this is a standard, easy dynamic programming problem: the best solution is given

How to break up a paragraph by sentences in Python

强颜欢笑 提交于 2019-11-27 01:58:04
问题 I need to parse sentences from a paragraph in Python. Is there an existing package to do this, or should I be trying to use regex here? 回答1: The nltk.tokenize module is designed for this and handles edge cases. For example: >>> from nltk import tokenize >>> p = "Good morning Dr. Adams. The patient is waiting for you in room number 3." >>> tokenize.sent_tokenize(p) ['Good morning Dr. Adams.', 'The patient is waiting for you in room number 3.'] 回答2: Here is how I am getting the first n

How to capitalize first letter of first word in a sentence?

笑着哭i 提交于 2019-11-26 20:43:58
I am trying to write a function to clean up user input. I am not trying to make it perfect. I would rather have a few names and acronyms in lowercase than a full paragraph in uppercase. I think the function should use regular expressions but I'm pretty bad with those and I need some help. If the following expressions are followed by a letter, I want to make that letter uppercase. "." ". " (followed by a space) "!" "! " (followed by a space) "?" "? " (followed by a space) Even better, the function could add a space after ".", "!" and "?" if those are followed by a letter. How this can be

How to get the first word of a sentence in PHP?

半世苍凉 提交于 2019-11-26 18:57:13
问题 I want to extract the first word of a variable from a string. For example, take this input: <?php $myvalue = 'Test me more'; ?> The resultant output should be Test , which is the first word of the input. How can I do this? 回答1: You can use the explode function as follows: $myvalue = 'Test me more'; $arr = explode(' ',trim($myvalue)); echo $arr[0]; // will print Test 回答2: There is a string function (strtok) which can be used to split a string into smaller strings ( tokens ) based on some

php sentence boundaries detection

大兔子大兔子 提交于 2019-11-26 17:35:15
I would like to divide a text into sentences in PHP. I'm currently using a regex, which brings ~95% accuracy and would like to improve by using a better approach. I've seen NLP tools that do that in Perl, Java, and C but didn't see anything that fits PHP. Do you know of such a tool? An enhanced regex solution Assuming you do care about handling: Mr. and Mrs. etc. abbreviations, then the following single regex solution works pretty well: <?php // test.php Rev:20160820_1800 $split_sentences = '%(?#!php/i split_sentences Rev:20160820_1800) # Split sentences on whitespace between them. # See: http