text-extraction | 易学教程

How to detect Text Area from image?

阅读更多关于 How to detect Text Area from image?

问题 i want to detect text area from image as a preprocessing step for tesseract OCR engine, the engine works well when the input is text only but when the input image contains Nontext content it falls, so i want to detect only text content in image,any idea of how to do that will be helpful,thanks. 回答1: Take a look at this bounding box technique demonstrated with OpenCV code: Input : Eroded : Result : 回答2: Well, I'm not well-experienced in image processing, but I hope I could help you with my

Text Extraction from HTML Java

阅读更多关于 Text Extraction from HTML Java

问题 I'm working on a program that downloads HTML pages and then selects some of the information and write it to another file. I want to extract the information which is intbetween the paragraph tags, but i can only get one line of the paragraph. My code is as follows; FileReader fileReader = new FileReader(file); BufferedReader buffRd = new BufferedReader(fileReader); BufferedWriter out = new BufferedWriter(new FileWriter(newFile.txt)); String s; while ((s = br.readLine()) !=null) { if(s.contains

Extracting text from a PDF file using PDFMiner in python?

阅读更多关于 Extracting text from a PDF file using PDFMiner in python?

Python Version 2.7 I am looking for documentation or examples on how to extract text from a PDF file using PDFMiner with Python. It looks like PDFMiner updated their API and all the relevant examples I have found contain outdated code(classes and methods have changed). The libraries I have found that make the task of extracting text from a PDF file easier are using the old PDFMiner syntax so I'm not sure how to do this. As it is, I'm just looking at source-code to see if I can figure it out. DuckPuncher Here is a working example of extracting text from a PDF file using the current version of

How to extract text from a PDF? [closed]

阅读更多关于 How to extract text from a PDF? [closed]

问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 4 years ago . Can anyone recommend a library/API for extracting the text and images from a PDF? We need to be able to get at text that is contained in pre-known regions of the document, so the API will need to give us positional information of each element on the page. We would like that data to be output in xml or json

Getting URL parameter in java and extract a specific text from that URL

阅读更多关于 Getting URL parameter in java and extract a specific text from that URL

问题 I have a URL and I need to get the value of v from this URL. Here is my URL: http://www.youtube.com/watch?v=_RCIP6OrQrE Any useful and fruitful help is highly appreciated.. 回答1: I think the one of the easiest ways out would be to parse the string returned by URL.getQuery() as public static Map<String, String> getQueryMap(String query) { String[] params = query.split("&"); Map<String, String> map = new HashMap<String, String>(); for (String param : params) { String name = param.split("=")[0];

How to extract string following a pattern with grep, regex or perl

阅读更多关于 How to extract string following a pattern with grep, regex or perl

问题 I have a file that looks something like this: <table name=\"content_analyzer\" primary-key=\"id\"> <type=\"global\" /> </table> <table name=\"content_analyzer2\" primary-key=\"id\"> <type=\"global\" /> </table> <table name=\"content_analyzer_items\" primary-key=\"id\"> <type=\"global\" /> </table> I need to extract anything within the quotes that follow name= , i.e., content_analyzer , content_analyzer2 and content_analyzer_items . I am doing this on a Linux box, so a solution using sed, perl

How to extract text from MS office documents in C#

阅读更多关于 How to extract text from MS office documents in C#

问题 I was trying to extract a text(string) from MS Word (.doc, .docx), Excel and Powerpoint using C#. Where can i find a free and simple .Net library to read MS Office documents? I tried to use NPOI but i didn\'t get a sample about how to use NPOI. 回答1: Using PInvokes you can use the IFilter interface (on Windows). The IFilters for many common file types are installed with Windows (you can browse them using this tool. You can just ask the IFilter to return you the text from the file. There are

regular expression to extract text from HTML

阅读更多关于 regular expression to extract text from HTML

问题 I would like to extract from a general HTML page, all the text (displayed or not). I would like to remove any HTML tags Any javascript Any CSS styles Is there a regular expression (one or more) that will achieve that? 回答1: You can't really parse HTML with regular expressions. It's too complex. RE's won't handle <![CDATA[ sections correctly at all. Further, some kinds of common HTML things like <text> will work in a browser as proper text, but might baffle a naive RE. You'll be happier and

Extracting text from a PDF file using PDFMiner in python?

阅读更多关于 Extracting text from a PDF file using PDFMiner in python?

问题 Python Version 2.7 I am looking for documentation or examples on how to extract text from a PDF file using PDFMiner with Python. It looks like PDFMiner updated their API and all the relevant examples I have found contain outdated code(classes and methods have changed). The libraries I have found that make the task of extracting text from a PDF file easier are using the old PDFMiner syntax so I\'m not sure how to do this. As it is, I\'m just looking at source-code to see if I can figure it out

How to extract a substring using regex

阅读更多关于 How to extract a substring using regex

问题 I have a string that has two single quotes in it, the \' character. In between the single quotes is the data I want. How can I write a regex to extract \"the data i want\" from the following text? mydata = \"some string with \'the data i want\' inside\"; 回答1: Assuming you want the part between single quotes, use this regular expression with a Matcher: "'(.*?)'" Example: String mydata = "some string with 'the data i want' inside"; Pattern pattern = Pattern.compile("'(.*?)'"); Matcher matcher =