text-extraction

PDF Text Extraction with Coordinates

时光怂恿深爱的人放手 提交于 2019-11-26 23:50:24
I would like to extract text from a portion (using coordinates) of PDF using Ghostscript. Can anyone help me out? Kurt Pfeifle Yes, with Ghostscript, you can extract text from PDFs. But no, it is not the best tool for the job. And no, you cannot do it in "portions" (parts of single pages). What you can do: extract the text of a certain range of pages only. First: Ghostscript's txtwrite output device (not so good) gs \ -dBATCH \ -dNOPAUSE \ -sDEVICE=txtwrite \ -dFirstPage=3 \ -dLastPage=5 \ -sOutputFile=- \ /path/to/your/pdf This will output all text contained on pages 3-5 to stdout. If you

How to extract common / significant phrases from a series of text entries

断了今生、忘了曾经 提交于 2019-11-26 23:50:10
问题 I have a series of text items- raw HTML from a MySQL database. I want to find the most common phrases in these entries (not the single most common phrase, and ideally, not enforcing word-for-word matching). My example is any review on Yelp.com, that shows 3 snippets from hundreds of reviews of a given restaurant, in the format: "Try the hamburger" (in 44 reviews) e.g., the "Review Highlights" section of this page: http://www.yelp.com/biz/sushi-gen-los-angeles/ I have NLTK installed and I've

Extract all email addresses from bulk text using jquery

大城市里の小女人 提交于 2019-11-26 22:08:10
I'm having the this text below: sdabhikagathara@rediffmail.com, "assdsdf" <dsfassdfhsdfarkal@gmail.com>, "rodnsdfald ferdfnson" <rfernsdfson@gmail.com>, "Affdmdol Gondfgale" <gyfanamosl@gmail.com>, "truform techno" <pidfpinfg@truformdftechnoproducts.com>, "NiTsdfeSh ThIdfsKaRe" <nthfsskare@ysahoo.in>, "akasdfsh kasdfstla" <akashkatsdfsa@yahsdfsfoo.in>, "Bisdsdfamal Prakaasdsh" <bimsdaalprakash@live.com>,; "milisdfsfnd ansdfasdfnsftwar" <dfdmilifsd.ensfdfcogndfdfatia@gmail.com> Here emails are seprated by , or ; . I want to extract all emails present above and store them in array. Is there any

Getting URL parameter in java and extract a specific text from that URL

泪湿孤枕 提交于 2019-11-26 21:00:44
I have a URL and I need to get the value of v from this URL. Here is my URL: http://www.youtube.com/watch?v=_RCIP6OrQrE Any useful and fruitful help is highly appreciated.. I think the one of the easiest ways out would be to parse the string returned by URL.getQuery() as public static Map<String, String> getQueryMap(String query) { String[] params = query.split("&"); Map<String, String> map = new HashMap<String, String>(); for (String param : params) { String name = param.split("=")[0]; String value = param.split("=")[1]; map.put(name, value); } return map; } You can use the map returned by

How to extract string following a pattern with grep, regex or perl

限于喜欢 提交于 2019-11-26 19:29:18
I have a file that looks something like this: <table name="content_analyzer" primary-key="id"> <type="global" /> </table> <table name="content_analyzer2" primary-key="id"> <type="global" /> </table> <table name="content_analyzer_items" primary-key="id"> <type="global" /> </table> I need to extract anything within the quotes that follow name= , i.e., content_analyzer , content_analyzer2 and content_analyzer_items . I am doing this on a Linux box, so a solution using sed, perl, grep or bash is fine. Since you need to match content without including it in the result (must match name=" but it's

Advanced PDF Parsing Using Python (extracting text without tables, etc): What's the Best Library? [closed]

試著忘記壹切 提交于 2019-11-26 18:44:36
问题 I'm looking for a PDF library which will allow me to extract the text from a PDF document. I've looked at PyPDF, and this can extract the text from a PDF document very nicely. The problem with this is that if there are tables in the document, the text in the tables is extracted in-line with the rest of the document text. This can be problematic because it produces sections of text that aren't useful and look garbled (for instance, lots of numbers mashed together). I'm looking for something

How to extract just plain text from .doc & .docx files? [closed]

微笑、不失礼 提交于 2019-11-26 17:29:59
问题 Anyone know of anything they can recommend in order to extract just the plain text from a .doc or .docx ? I've found this - wondered if there were any other suggestions? 回答1: If you want the pure plain text(my requirement) then all you need is unzip -p some.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g' Which I found at command line fu It unzips the docx file and gets the actual document then strips all the xml tags. Obviously all formatting is lost. 回答2:

How to extract text from MS office documents in C#

巧了我就是萌 提交于 2019-11-26 17:28:43
I was trying to extract a text(string) from MS Word (.doc, .docx), Excel and Powerpoint using C#. Where can i find a free and simple .Net library to read MS Office documents? I tried to use NPOI but i didn't get a sample about how to use NPOI. Using PInvokes you can use the IFilter interface (on Windows). The IFilters for many common file types are installed with Windows (you can browse them using this tool. You can just ask the IFilter to return you the text from the file. There are several sets of example code ( here is one such example). For Microsoft Word 2007 and Microsoft Word 2010 (

Extracting whole words

时间秒杀一切 提交于 2019-11-26 16:55:17
问题 I have a large set of real-world text that I need to pull words out of to input into a spell checker. I'd like to extract as many meaningful words as possible without too much noise. I know there's plenty of regex ninjas around here, so hopefully someone can help me out. Currently I'm extracting all alphabetical sequences with '[a-z]+' . This is an okay approximation, but it drags a lot of rubbish out with it. Ideally I would like some regex (doesn't have to be pretty or efficient) that

regular expression to extract text from HTML

折月煮酒 提交于 2019-11-26 16:11:40
I would like to extract from a general HTML page, all the text (displayed or not). I would like to remove any HTML tags Any javascript Any CSS styles Is there a regular expression (one or more) that will achieve that? You can't really parse HTML with regular expressions. It's too complex. RE's won't handle <![CDATA[ sections correctly at all. Further, some kinds of common HTML things like <text> will work in a browser as proper text, but might baffle a naive RE. You'll be happier and more successful with a proper HTML parser. Python folks often use something Beautiful Soup to parse HTML and