text-extraction

Jsoup - extracting text

馋奶兔 提交于 2019-11-28 00:15:53
问题 I need to extract text from a node like this: <div> Some text <b>with tags</b> might go here. <p>Also there are paragraphs</p> More text can go without paragraphs<br/> </div> And I need to build: Some text <b>with tags</b> might go here. Also there are paragraphs More text can go without paragraphs Element.text returns just all content of the div. Element.ownText - everything that is not inside children elements. Both are wrong. Iterating through children ignores text nodes. Is there are way

Advanced PDF Parsing Using Python (extracting text without tables, etc): What's the Best Library? [closed]

人盡茶涼 提交于 2019-11-27 16:44:45
I'm looking for a PDF library which will allow me to extract the text from a PDF document. I've looked at PyPDF, and this can extract the text from a PDF document very nicely. The problem with this is that if there are tables in the document, the text in the tables is extracted in-line with the rest of the document text. This can be problematic because it produces sections of text that aren't useful and look garbled (for instance, lots of numbers mashed together). I'm looking for something that's a bit more advanced. I'd like to extract the text from a PDF document, excluding any tables and

Extracting whole words

送分小仙女□ 提交于 2019-11-27 14:48:39
I have a large set of real-world text that I need to pull words out of to input into a spell checker. I'd like to extract as many meaningful words as possible without too much noise. I know there's plenty of regex ninjas around here, so hopefully someone can help me out. Currently I'm extracting all alphabetical sequences with '[a-z]+' . This is an okay approximation, but it drags a lot of rubbish out with it. Ideally I would like some regex (doesn't have to be pretty or efficient) that extracts all alphabetical sequences delimited by natural word separators (such as [/-_,.: ] etc.), and

How to detect Text Area from image?

随声附和 提交于 2019-11-27 12:56:15
i want to detect text area from image as a preprocessing step for tesseract OCR engine, the engine works well when the input is text only but when the input image contains Nontext content it falls, so i want to detect only text content in image,any idea of how to do that will be helpful,thanks. Take a look at this bounding box technique demonstrated with OpenCV code: Input : Eroded : Result : Well, I'm not well-experienced in image processing, but I hope I could help you with my theoretical approach. In most cases, text is forming parallel, horisontal rows, where the space between rows will

Extract columns of text from a pdf file using iText

☆樱花仙子☆ 提交于 2019-11-27 10:12:01
问题 I need to extract text from pdf files using iText. The problem is: some pdf files contain 2 columns and when I extract text I get a text file where columns are merged as the result (i.e. text from both columns in the same line) this is the code: public class pdf { private static String INPUTFILE = "http://www.revuemedecinetropicale.com/TAP_519-522_-_AO_07151GT_Rasoamananjara__ao.pdf" ; private static String OUTPUTFILE = "c:/new3.pdf"; public static void main(String[] args) throws

Regexp for extracting a mailto: address

百般思念 提交于 2019-11-27 08:12:53
问题 I'd like a reg exp which can take a block of string, and find the strings matching the format: <a href="mailto:x@x.com">....</a> And for all strings which match this format, it will extract out the email address found after the mailto: . Any thoughts? This is needed for an internal app and not for any spammer purposes! 回答1: If you want to match the whole thing from : $r = '`\<a([^>]+)href\=\"mailto\:([^">]+)\"([^>]*)\>(.*?)\<\/a\>`ism'; preg_match_all($r,$html, $matches, PREG_SET_ORDER); To

Text Extraction from HTML Java

戏子无情 提交于 2019-11-27 08:04:36
I'm working on a program that downloads HTML pages and then selects some of the information and write it to another file. I want to extract the information which is intbetween the paragraph tags, but i can only get one line of the paragraph. My code is as follows; FileReader fileReader = new FileReader(file); BufferedReader buffRd = new BufferedReader(fileReader); BufferedWriter out = new BufferedWriter(new FileWriter(newFile.txt)); String s; while ((s = br.readLine()) !=null) { if(s.contains("<p>")) { try { out.write(s); } catch (IOException e) { } } } i was trying to add another while loop,

C# Extract text from PDF using PdfSharp

一世执手 提交于 2019-11-27 04:24:29
问题 Is there a possibility to extract plain text from a PDF-File with PdfSharp? I don't want to use iTextSharp because of its license. 回答1: Took Sergio's answer and made some extension methods. I also changed the accumulation of strings into an iterator. public static class PdfSharpExtensions { public static IEnumerable<string> ExtractText(this PdfPage page) { var content = ContentReader.ReadContent(page); var text = content.ExtractText(); return text; } public static IEnumerable<string>

How to extract just plain text from .doc & .docx files? [closed]

六眼飞鱼酱① 提交于 2019-11-27 02:50:51
Anyone know of anything they can recommend in order to extract just the plain text from a .doc or .docx ? I've found this - wondered if there were any other suggestions? If you want the pure plain text(my requirement) then all you need is unzip -p some.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g' Which I found at command line fu It unzips the docx file and gets the actual document then strips all the xml tags. Obviously all formatting is lost. LibreOffice One option is libreoffice /openoffice in headless mode (make sure all other instances of libreoffice are

How to extract regex matches using Vim

倾然丶 夕夏残阳落幕 提交于 2019-11-27 01:37:37
问题 Sample: case Foo: ... break; case Bar: ... break; case More: case Complex: ... break: ... I'd like to retrieve all the regex matches (the whole matching text, or even better, the part between \( and \) ) of the RegEx case \([^:]*\): which should give something like (in a new new file): Foo Bar More Complex ... Another example of use case would be the extraction of some parts, likes images URLs, from an HTML file. Is there a simple way to graph all RegEx matches and put them in a buffer in Vim