text-extraction

How to detect Text Area from image?

前提是你 提交于 2019-11-26 16:09:50
问题 i want to detect text area from image as a preprocessing step for tesseract OCR engine, the engine works well when the input is text only but when the input image contains Nontext content it falls, so i want to detect only text content in image,any idea of how to do that will be helpful,thanks. 回答1: Take a look at this bounding box technique demonstrated with OpenCV code: Input : Eroded : Result : 回答2: Well, I'm not well-experienced in image processing, but I hope I could help you with my

Text Extraction from HTML Java

混江龙づ霸主 提交于 2019-11-26 13:59:06
问题 I'm working on a program that downloads HTML pages and then selects some of the information and write it to another file. I want to extract the information which is intbetween the paragraph tags, but i can only get one line of the paragraph. My code is as follows; FileReader fileReader = new FileReader(file); BufferedReader buffRd = new BufferedReader(fileReader); BufferedWriter out = new BufferedWriter(new FileWriter(newFile.txt)); String s; while ((s = br.readLine()) !=null) { if(s.contains

Extracting text from a PDF file using PDFMiner in python?

。_饼干妹妹 提交于 2019-11-26 12:03:22
Python Version 2.7 I am looking for documentation or examples on how to extract text from a PDF file using PDFMiner with Python. It looks like PDFMiner updated their API and all the relevant examples I have found contain outdated code(classes and methods have changed). The libraries I have found that make the task of extracting text from a PDF file easier are using the old PDFMiner syntax so I'm not sure how to do this. As it is, I'm just looking at source-code to see if I can figure it out. DuckPuncher Here is a working example of extracting text from a PDF file using the current version of

How to extract text from a PDF? [closed]

风格不统一 提交于 2019-11-26 09:15:02
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 4 years ago . Can anyone recommend a library/API for extracting the text and images from a PDF? We need to be able to get at text that is contained in pre-known regions of the document, so the API will need to give us positional information of each element on the page. We would like that data to be output in xml or json

Getting URL parameter in java and extract a specific text from that URL

被刻印的时光 ゝ 提交于 2019-11-26 07:47:07
问题 I have a URL and I need to get the value of v from this URL. Here is my URL: http://www.youtube.com/watch?v=_RCIP6OrQrE Any useful and fruitful help is highly appreciated.. 回答1: I think the one of the easiest ways out would be to parse the string returned by URL.getQuery() as public static Map<String, String> getQueryMap(String query) { String[] params = query.split("&"); Map<String, String> map = new HashMap<String, String>(); for (String param : params) { String name = param.split("=")[0];

How to extract string following a pattern with grep, regex or perl

偶尔善良 提交于 2019-11-26 06:57:06
问题 I have a file that looks something like this: <table name=\"content_analyzer\" primary-key=\"id\"> <type=\"global\" /> </table> <table name=\"content_analyzer2\" primary-key=\"id\"> <type=\"global\" /> </table> <table name=\"content_analyzer_items\" primary-key=\"id\"> <type=\"global\" /> </table> I need to extract anything within the quotes that follow name= , i.e., content_analyzer , content_analyzer2 and content_analyzer_items . I am doing this on a Linux box, so a solution using sed, perl

How to extract text from MS office documents in C#

戏子无情 提交于 2019-11-26 05:25:45
问题 I was trying to extract a text(string) from MS Word (.doc, .docx), Excel and Powerpoint using C#. Where can i find a free and simple .Net library to read MS Office documents? I tried to use NPOI but i didn\'t get a sample about how to use NPOI. 回答1: Using PInvokes you can use the IFilter interface (on Windows). The IFilters for many common file types are installed with Windows (you can browse them using this tool. You can just ask the IFilter to return you the text from the file. There are

regular expression to extract text from HTML

对着背影说爱祢 提交于 2019-11-26 04:44:31
问题 I would like to extract from a general HTML page, all the text (displayed or not). I would like to remove any HTML tags Any javascript Any CSS styles Is there a regular expression (one or more) that will achieve that? 回答1: You can't really parse HTML with regular expressions. It's too complex. RE's won't handle <![CDATA[ sections correctly at all. Further, some kinds of common HTML things like <text> will work in a browser as proper text, but might baffle a naive RE. You'll be happier and

Extracting text from a PDF file using PDFMiner in python?

瘦欲@ 提交于 2019-11-26 02:17:24
问题 Python Version 2.7 I am looking for documentation or examples on how to extract text from a PDF file using PDFMiner with Python. It looks like PDFMiner updated their API and all the relevant examples I have found contain outdated code(classes and methods have changed). The libraries I have found that make the task of extracting text from a PDF file easier are using the old PDFMiner syntax so I\'m not sure how to do this. As it is, I\'m just looking at source-code to see if I can figure it out

How to extract a substring using regex

馋奶兔 提交于 2019-11-26 00:35:49
问题 I have a string that has two single quotes in it, the \' character. In between the single quotes is the data I want. How can I write a regex to extract \"the data i want\" from the following text? mydata = \"some string with \'the data i want\' inside\"; 回答1: Assuming you want the part between single quotes, use this regular expression with a Matcher: "'(.*?)'" Example: String mydata = "some string with 'the data i want' inside"; Pattern pattern = Pattern.compile("'(.*?)'"); Matcher matcher =