text-extraction

Not able to read the exact text highlighted across the lines

落爺英雄遲暮 提交于 2019-12-01 01:28:58
I am working on reading the highlighted from PDF document using PDBox. I was able to read the highlighted text in single line both single and multiple words. However, I could not read the highlighted text across the lines. Please find the following sample code to read the highlighted text. PDDocument pddDocument = PDDocument.load(new File("C:\\pdf-sample.pdf")); List allPages = pddDocument.getDocumentCatalog().getAllPages(); for (int i = 0; i < allPages.size(); i++) { int pageNum = i + 1; PDPage page = (PDPage) allPages.get(i); List<PDAnnotation> la = page.getAnnotations(); if (la.size() < 1)

Extracting readable text from HTML using Python?

谁说我不能喝 提交于 2019-11-30 22:59:32
I know about utils like html2text, BeautifulSoup etc. but the issue is that they also extract javascript and add it to the text making it tough to separate them. htmlDom = BeautifulSoup(webPage) htmlDom.findAll(text=True) Alternately, from stripogram import html2text extract = html2text(webPage) Both of these extract all the javascript on the page as well, this is undesired. I just wanted the readable text which you could copy from your browser to be extracted. If you want to avoid extracting any of the contents of script tags with BeautifulSoup, nonscripttags = htmlDom.findAll(lambda t: t

Parsing date from text using Ruby

旧城冷巷雨未停 提交于 2019-11-30 16:37:29
I'm trying to figure out how to extract dates from unstructured text using Ruby. For example, I'd like to parse the date out of this string "Applications started after 12:00 A.M. Midnight (EST) February 1, 2010 will not be considered." Any suggestions? Assuming you just want dates and not datetimes: require 'date' string = "Applications started after 12:00 A.M. Midnight (EST) February 1, 2010 will not be considered." r = /(January|February|March|April|May|June|July|August|September|October|November|December) (\d+{1,2}), (\d{4})/ if string[r] date =Date.parse(string[r]) puts date end Try

Scraping text from file within HTML tags

人走茶凉 提交于 2019-11-30 08:53:04
问题 I have a file that I want to extract dates from, it's a HTML source file so it's full of code and phrases I don't need. I need to extract every instance of a date that's wrapped in a specific HTML tag: abbr title="((this is the text I need))" data-utime=" What's the easiest way to achieve this? 回答1: If you're using Excel VBA, set a reference (Tools - References) to the MSHTML library (entitled Microsoft HTML Object Library in the reference menu) Sub ScrapeDateAbbr() Dim hDoc As MSHTML

How to extract text from resonably sane HTML?

若如初见. 提交于 2019-11-30 07:05:48
My question is sort of like this question but I have more constraints: I know the document's are reasonably sane they are very regular (they all came from the same source I want about 99% of the visible text about 99% of what is viable at all is text (they are more or less RTF converted to HTML) I don't care about formatting or even paragraph breaks. Are there any tools set up to do this or am I better off just breaking out RegexBuddy and C#? I'm open to command line or batch processing tools as well as C/C#/D libraries. SLaks You need to use the HTML Agility Pack . You probably want to find

Parsing date from text using Ruby

戏子无情 提交于 2019-11-29 23:58:12
问题 I'm trying to figure out how to extract dates from unstructured text using Ruby. For example, I'd like to parse the date out of this string "Applications started after 12:00 A.M. Midnight (EST) February 1, 2010 will not be considered." Any suggestions? 回答1: Assuming you just want dates and not datetimes: require 'date' string = "Applications started after 12:00 A.M. Midnight (EST) February 1, 2010 will not be considered." r = /(January|February|March|April|May|June|July|August|September

Extract part of string between two different patterns

橙三吉。 提交于 2019-11-29 09:49:00
问题 I try to use stringr package to extract part of a string, which is between two particular patterns. For example, I have: my.string <- "nanaqwertybaba" left.border <- "nana" right.border <- "baba" and by the use of str_extract(string, pattern) function (where pattern is defined by a POSIX regular expression ) I would like to receive: "qwerty" Solutions from Google did not work. 回答1: I do not know whether and how this is possible with functions provided by stringr but you can also use base

Scraping text from file within HTML tags

泄露秘密 提交于 2019-11-29 08:51:36
I have a file that I want to extract dates from, it's a HTML source file so it's full of code and phrases I don't need. I need to extract every instance of a date that's wrapped in a specific HTML tag: abbr title="((this is the text I need))" data-utime=" What's the easiest way to achieve this? Dick Kusleika If you're using Excel VBA, set a reference (Tools - References) to the MSHTML library (entitled Microsoft HTML Object Library in the reference menu) Sub ScrapeDateAbbr() Dim hDoc As MSHTML.HTMLDocument Dim hElem As MSHTML.HTMLGenericElement Dim sFile As String, lFile As Long Dim sHtml As

How to extract text from resonably sane HTML?

谁都会走 提交于 2019-11-29 07:42:42
问题 My question is sort of like this question but I have more constraints: I know the document's are reasonably sane they are very regular (they all came from the same source I want about 99% of the visible text about 99% of what is viable at all is text (they are more or less RTF converted to HTML) I don't care about formatting or even paragraph breaks. Are there any tools set up to do this or am I better off just breaking out RegexBuddy and C#? I'm open to command line or batch processing tools

Jsoup - extracting text

一曲冷凌霜 提交于 2019-11-29 06:53:40
I need to extract text from a node like this: <div> Some text <b>with tags</b> might go here. <p>Also there are paragraphs</p> More text can go without paragraphs<br/> </div> And I need to build: Some text <b>with tags</b> might go here. Also there are paragraphs More text can go without paragraphs Element.text returns just all content of the div. Element.ownText - everything that is not inside children elements. Both are wrong. Iterating through children ignores text nodes. Is there are way to iterate contents of an element to receive text nodes as well. E.g. Text node - Some text Node <b> -