text-extraction

Using boilerpipe to extract non-english articles

[亡魂溺海] 提交于 2019-12-04 06:01:59
I am trying to use boilerpipe java library, to extract news articles from a set of websites. It works great for texts in english, but for text with special characters, for example, words with accent marks (história), this special characters are not extracted correctly. I think it is an encoding problem. In the boilerpipe faq, it says "If you extract non-English text you might need to change some parameters" and then refers to a paper . I found no solution in this paper. My question is, are there any params when using boilerpipe where i can specify the encoding? Is there any way to go around

PDFminer: extract text with its font information

孤街醉人 提交于 2019-12-04 03:52:53
I find this question , but it uses command line, and I do not want to call a Python script in command line using subprocess and parse HTML files to get the font information. I want to use PDFminer as a library, and I find this question , but they are just all about extracting plain texts, without other information such as font name, font size, and so on. #!/usr/bin/env python from pdfminer.pdfparser import PDFParser from pdfminer.pdfdocument import PDFDocument from pdfminer.pdfpage import PDFPage from pdfminer.pdfinterp import PDFResourceManager from pdfminer.pdfinterp import

Not able to read the exact text highlighted across the lines

与世无争的帅哥 提交于 2019-12-03 22:13:59
问题 I am working on reading the highlighted from PDF document using PDBox. I was able to read the highlighted text in single line both single and multiple words. However, I could not read the highlighted text across the lines. Please find the following sample code to read the highlighted text. PDDocument pddDocument = PDDocument.load(new File("C:\\pdf-sample.pdf")); List allPages = pddDocument.getDocumentCatalog().getAllPages(); for (int i = 0; i < allPages.size(); i++) { int pageNum = i + 1;

Does Tesseract neglect any nontext area in a scanned document?

假如想象 提交于 2019-12-03 21:05:30
I'm using Tesseract but I don't know whether it neglects any nontext area and targets text only. Do I have to remove any nontext area as a preprocessing step for better output? karlphillip Tesseract has a pretty good algorithm to detect text, but it will eventually give false-positive matches. Ideally, you would pre-process the image before submitting it to tesseract. Some time ago I engaged in a similar task, so I suggest you take a look at the following material: OpenCV C++/Obj-C: Detecting a sheet of paper / Square Detection Executing cv::warpPerspective for a fake deskewing on a set of cv:

How to extract text under specific headings from a pdf?

◇◆丶佛笑我妖孽 提交于 2019-12-03 14:41:13
I want to extract text under specific headings from a pdf using python. For example, I have a pdf with headings Introduction,Summary,Contents. I need to extract only the text under the heading 'Summary'. How can I do this? This scenario is exactly what I am working on in my current company. We need to extract text lying under a heading. I'm personally using a rule based system i.e, using regex to identify all the numbered headings after reading the entire document line by line. Once I have the headings I enter the name of the heading for which I want to find the corresponding paragraph. This

How can i read pdf in python? [duplicate]

这一生的挚爱 提交于 2019-12-03 00:52:54
This question already has answers here : How to extract text from a PDF file? (17 answers) How can i read pdf in python? I know one way of converting it to text , but i want to read the content directly from pdf. Can anyone explain which module in python is best for pdf extraction shankarj67 You can USE PyPDF2 package #install pyDF2 pip install PyPDF2 # importing all the required modules import PyPDF2 # creating an object file = open('example.pdf', 'rb') # creating a pdf reader object fileReader = PyPDF2.PdfFileReader(file) # print the number of pages in pdf file print(fileReader.numPages)

What is the state of the art in HTML content extraction?

半腔热情 提交于 2019-12-03 00:30:02
问题 There's a lot of scholarly work on HTML content extraction, e.g., Gupta & Kaiser (2005) Extracting Content from Accessible Web Pages, and some signs of interest here, e.g., one, two, and three, but I'm not really clear about how well the practice of the latter reflects the ideas of the former. What is the best practice? Pointers to good (in particular, open source) implementations and good scholarly surveys of implementations would be the kind of thing I'm looking for. Postscript the first :

Extract pictures from Word and Excel with Python

寵の児 提交于 2019-12-02 22:20:35
问题 I was searching for a way to strip out pictures from these file types and this is the solution I came up with. It iterates through a given directory structure, copies any files with the proper extension, and renames the copy to filename.zip. Then it navigates through the zip structure and extracts all picture type files with the proper extension, and renames them to the original file name, with a number for uniqueness. Finally, it deletes the extracted directory trees it created. Extracting

Read text(data) in an images using c# [closed]

本秂侑毒 提交于 2019-12-02 21:31:25
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 4 years ago . Is there a way to read text(numbers and letters) in an image using C# ? Is this possible and What is the best way to do this ? Thanks! 回答1: http://code.google.com/p/tesseract-ocr/ has some wrapper to use it in .NET, or, simpler: http://www.codeproject.com/KB/office/modi.aspx but

What is the state of the art in HTML content extraction?

与世无争的帅哥 提交于 2019-12-02 14:07:15
There's a lot of scholarly work on HTML content extraction, e.g., Gupta & Kaiser (2005) Extracting Content from Accessible Web Pages , and some signs of interest here, e.g., one , two , and three , but I'm not really clear about how well the practice of the latter reflects the ideas of the former. What is the best practice? Pointers to good (in particular, open source) implementations and good scholarly surveys of implementations would be the kind of thing I'm looking for. Postscript the first : To be precise, the kind of survey I'm after would be a paper (published, unpublished, whatever)