text-segmentation

Splitting paragraphs into sentences with regexp and PHP

大城市里の小女人 提交于 2019-12-18 11:55:33
问题 I'm a regexp noob and trying to split paragraphs into sentences. In my language we use quite a bit of abbreviations (like: bl.a.) in the middle of sentences, so I have come to the conclusion, that what I need to do is to look for punctuations, that are followed by a single space and then a word that starts with a capital letter like: [sentence1]...anymore. However...[sentence2] So a paragraph like: Der er en lang og bevæget forhistorie bag lov om varsling m.v. i forbindelse med afskedigelser

Split a sentence into separate words

孤街浪徒 提交于 2019-12-18 10:55:19
问题 I need to split a Chinese sentence into separate words. The problem with Chinese is that there are no spaces. For example, the sentence may look like: 主楼怎么走 (with spaces it would be: 主楼 怎么 走 ). At the moment I can think of one solution. I have a dictionary with Chinese words (in a database). The script will: try to find the first two characters of the sentence in the database ( 主楼 ), if 主楼 is actually a word and it's in the database the script will try to find first three characters ( 主楼怎 ).

How to split paragraphs into sentences?

时光怂恿深爱的人放手 提交于 2019-12-18 03:40:33
问题 Please have a look at the following. String[]sentenceHolder = titleAndBodyContainer.split("\n|\\.(?!\\d)|(?<!\\d)\\."); This is how I tried to split a paragraph into sentences. But, there is a problem. My paragraph includes dates like Jan. 13, 2014 , words like U.S and numbers like 2.2 . They all got splitted by the above code. So basically, this code splits lot of 'dots' whether it is a full stop or not. I tried String[]sentenceHolder = titleAndBodyContainer.split(".\n"); and String[

Sentence segmentation using Regex

断了今生、忘了曾经 提交于 2019-12-13 16:34:49
问题 I have few text(SMS) messages and I want to segment them using period('.') as a delimiter. I am unable to handle following types of messages. How can I segment these messages using Regex in Python. Before segmentation: 'hyper count 16.8mmol/l.plz review b4 5pm.just to inform u.thank u' 'no of beds 8.please inform person in-charge.tq' After segmentation: 'hyper count 16.8mmol/l' 'plz review b4 5pm' 'just to inform u' 'thank u' 'no of beds 8' 'please inform person in-charge' 'tq' Each line is a

how to extract a whole sentence by a single word match in a string?

徘徊边缘 提交于 2019-12-13 02:14:05
问题 So I have got a whole string (about 10k chars) and then searching for a word(or many words) in that string. With regex(word).Matches(scrappedstring) . But how to do so to extract the whole sentence, that contains that word. I was thinking of taking a substring after the searched word until the first dot/exclamation mark/question mark/etc. But how to take the part of the sentence before the searched word ? Or maybe there's a better logic ? 回答1: If your boundaries are e.g. . , ! , ? and ; ,

Python extracting sentence containing 2 words

流过昼夜 提交于 2019-12-12 01:14:36
问题 I have the same problem that was discussed in this link Python extract sentence containing word, but the difference is that I want to find 2 words in the same sentence. I need to extract sentences from a corpus, which contains 2 specific words. Does anyone could help me, please? 回答1: If this is what you mean: import re txt="I like to eat apple. Me too. Let's go buy some apples." define_words = 'some apple' print re.findall(r"([^.]*?%s[^.]*\.)" % define_words,txt) Output: [" Let's go buy some

segment each character from noisy number plate

ぐ巨炮叔叔 提交于 2019-12-11 15:54:10
问题 I am doing a project on Nepali Number Plate Detection where I have detected my number plate from the vehicle ani skewed the number plate but the result is a noisy image of number plate. I want to know how to segment every character out of it so it could be sent for detection part. I tried doing this but it just segmented the characters from second line. def segment(image): H = 100. height, width, depth = image.shape imgScale = H/height newX,newY = image.shape[1]*imgScale, image.shape[0]

Splitting HTML Content Into Sentences, But Keeping Subtags Intact

对着背影说爱祢 提交于 2019-12-10 23:16:38
问题 I'm using the code below to separate all text within a paragraph tag into sentences. It is working okay with a few exceptions. However, tags within paragraphs are chewed up and spit out. Example: <p>This is a sample of a <a href="#">link</a> getting chewed up.</p> So, how can I ignore tags such that I could just parse sentences and place span tags around them and keep , , etc...tags in place? Or is it smarter to somehow walk the DOM and do it that way? // Split text on page into clickable

extract a sentence using python

孤街浪徒 提交于 2019-12-07 15:22:50
问题 I would like to extract the exact sentence if a particular word is present in that sentence. Could anyone let me know how to do it with python. I used concordance() but it only prints lines where the word matches. 回答1: Just a quick reminder: Sentence breaking is actually a pretty complex thing, there's exceptions to the period rule, such as "Mr." or "Dr." There's also a variety of sentence ending punctuation marks. But there's also exceptions to the exception (if the next word is Capitalized

Sentence matching with regex

让人想犯罪 __ 提交于 2019-12-07 13:38:35
问题 I have a text that splits into many lines, no particular formats. So I decided to line.strip('\n') for each line. Then I want to split the text into sentences using the sentence end marker . considering: period . that is followed by a \s (whitespace), \S (like " ' ) and followed by [A-Z] will split not to split [0-9]\.[A-Za-z] , like 1.stackoverflow real time solution . My program only solve half of 1 - period (.) that is followed by a \s and [A-Z]. Below is the code: # -*- coding: utf-8 -*-