text-extraction | 易学教程

Not able to read the exact text highlighted across the lines

阅读更多关于 Not able to read the exact text highlighted across the lines

I am working on reading the highlighted from PDF document using PDBox. I was able to read the highlighted text in single line both single and multiple words. However, I could not read the highlighted text across the lines. Please find the following sample code to read the highlighted text. PDDocument pddDocument = PDDocument.load(new File("C:\\pdf-sample.pdf")); List allPages = pddDocument.getDocumentCatalog().getAllPages(); for (int i = 0; i < allPages.size(); i++) { int pageNum = i + 1; PDPage page = (PDPage) allPages.get(i); List<PDAnnotation> la = page.getAnnotations(); if (la.size() < 1)

Extracting readable text from HTML using Python?

阅读更多关于 Extracting readable text from HTML using Python?

I know about utils like html2text, BeautifulSoup etc. but the issue is that they also extract javascript and add it to the text making it tough to separate them. htmlDom = BeautifulSoup(webPage) htmlDom.findAll(text=True) Alternately, from stripogram import html2text extract = html2text(webPage) Both of these extract all the javascript on the page as well, this is undesired. I just wanted the readable text which you could copy from your browser to be extracted. If you want to avoid extracting any of the contents of script tags with BeautifulSoup, nonscripttags = htmlDom.findAll(lambda t: t

Parsing date from text using Ruby

阅读更多关于 Parsing date from text using Ruby

I'm trying to figure out how to extract dates from unstructured text using Ruby. For example, I'd like to parse the date out of this string "Applications started after 12:00 A.M. Midnight (EST) February 1, 2010 will not be considered." Any suggestions? Assuming you just want dates and not datetimes: require 'date' string = "Applications started after 12:00 A.M. Midnight (EST) February 1, 2010 will not be considered." r = /(January|February|March|April|May|June|July|August|September|October|November|December) (\d+{1,2}), (\d{4})/ if string[r] date =Date.parse(string[r]) puts date end Try

Scraping text from file within HTML tags

阅读更多关于 Scraping text from file within HTML tags

问题 I have a file that I want to extract dates from, it's a HTML source file so it's full of code and phrases I don't need. I need to extract every instance of a date that's wrapped in a specific HTML tag: abbr title="((this is the text I need))" data-utime=" What's the easiest way to achieve this? 回答1: If you're using Excel VBA, set a reference (Tools - References) to the MSHTML library (entitled Microsoft HTML Object Library in the reference menu) Sub ScrapeDateAbbr() Dim hDoc As MSHTML

How to extract text from resonably sane HTML?

阅读更多关于 How to extract text from resonably sane HTML?

My question is sort of like this question but I have more constraints: I know the document's are reasonably sane they are very regular (they all came from the same source I want about 99% of the visible text about 99% of what is viable at all is text (they are more or less RTF converted to HTML) I don't care about formatting or even paragraph breaks. Are there any tools set up to do this or am I better off just breaking out RegexBuddy and C#? I'm open to command line or batch processing tools as well as C/C#/D libraries. SLaks You need to use the HTML Agility Pack . You probably want to find

Parsing date from text using Ruby

阅读更多关于 Parsing date from text using Ruby

问题 I'm trying to figure out how to extract dates from unstructured text using Ruby. For example, I'd like to parse the date out of this string "Applications started after 12:00 A.M. Midnight (EST) February 1, 2010 will not be considered." Any suggestions? 回答1: Assuming you just want dates and not datetimes: require 'date' string = "Applications started after 12:00 A.M. Midnight (EST) February 1, 2010 will not be considered." r = /(January|February|March|April|May|June|July|August|September

Extract part of string between two different patterns

阅读更多关于 Extract part of string between two different patterns

问题 I try to use stringr package to extract part of a string, which is between two particular patterns. For example, I have: my.string <- "nanaqwertybaba" left.border <- "nana" right.border <- "baba" and by the use of str_extract(string, pattern) function (where pattern is defined by a POSIX regular expression ) I would like to receive: "qwerty" Solutions from Google did not work. 回答1: I do not know whether and how this is possible with functions provided by stringr but you can also use base

Scraping text from file within HTML tags

阅读更多关于 Scraping text from file within HTML tags

I have a file that I want to extract dates from, it's a HTML source file so it's full of code and phrases I don't need. I need to extract every instance of a date that's wrapped in a specific HTML tag: abbr title="((this is the text I need))" data-utime=" What's the easiest way to achieve this? Dick Kusleika If you're using Excel VBA, set a reference (Tools - References) to the MSHTML library (entitled Microsoft HTML Object Library in the reference menu) Sub ScrapeDateAbbr() Dim hDoc As MSHTML.HTMLDocument Dim hElem As MSHTML.HTMLGenericElement Dim sFile As String, lFile As Long Dim sHtml As

How to extract text from resonably sane HTML?

阅读更多关于 How to extract text from resonably sane HTML?

问题 My question is sort of like this question but I have more constraints: I know the document's are reasonably sane they are very regular (they all came from the same source I want about 99% of the visible text about 99% of what is viable at all is text (they are more or less RTF converted to HTML) I don't care about formatting or even paragraph breaks. Are there any tools set up to do this or am I better off just breaking out RegexBuddy and C#? I'm open to command line or batch processing tools

Jsoup - extracting text

阅读更多关于 Jsoup - extracting text

I need to extract text from a node like this: <div> Some text <b>with tags</b> might go here. <p>Also there are paragraphs</p> More text can go without paragraphs<br/> </div> And I need to build: Some text <b>with tags</b> might go here. Also there are paragraphs More text can go without paragraphs Element.text returns just all content of the div. Element.ownText - everything that is not inside children elements. Both are wrong. Iterating through children ignores text nodes. Is there are way to iterate contents of an element to receive text nodes as well. E.g. Text node - Some text Node <b> -