extraction

Extract Images and Words with coordinates and sizes from PDF

回眸只為那壹抹淺笑 提交于 2019-11-30 13:39:29
I've read much about PDF extractions and libraries (as iText) but i just haven't found a solution to extract images and text (with coordinates) from a PDF. The task is to scan PDF with catalog of products and extract each image. There is an image code printed next to each image and also a list of product codes for products that are shown on the image. I know that there is no way to extract structured info from a PDF like this but with coordinates of all image and text objects I could write code to identify linked text by its distance from the image. Then I could split text using a RegExp and

Extract the Text in a Element with JQuery

南楼画角 提交于 2019-11-30 09:12:18
I want to extract the Text inside a Element with JQuery <div id="bla"> <span><strong>bla bla bla</strong>I want this text</span> </div> I want only the text "I want this text" without the strong-tag. How can I do that? Try this... <script type="text/javascript"> //<![CDATA[ $(document).ready(function(){ $("#bla span").contents().each(function(i) { if(this.nodeName == "#text") alert(this.textContent); }); }); //]]> </script> This doesn't need to remove any other nodes from context, and will just give you the text node(s) on each iteration. This does it (tested): var clone = $("#bla > span")

Java: Apache POI: Can I get clean text from MS Word (.doc) files?

自古美人都是妖i 提交于 2019-11-30 08:48:49
The strings I'm (programmatically) getting from MS Word files when using Apache POI are not the same text I can look at when I open the files with MS Word. When using the following code: File someFile = new File("some\\path\\MSWFile.doc"); InputStream inputStrm = new FileInputStream(someFile); HWPFDocument wordDoc = new HWPFDocument(inputStrm); System.out.println(wordDoc.getText()); the output is a single line with many 'invalid' characters (yes, the 'boxes'), and many unwanted strings, like " FORMTEXT ", " HYPERLINK \l "_Toc##########" " ('#' being numeric digits), " PAGEREF _Toc########## \h

OpenCV: How to get inlier points using findHomography()/findFundamental() and RANSAC

为君一笑 提交于 2019-11-30 06:56:18
OpenCV does not provide a RANSAC-function per se or at least in such a form that you can just call it and be done with it (e.g. cv::ransac(...) ). All functions/methods that are able to use RANSAC have a flag that enables it. However this is not always useful if you actually want to do something else with the inliers RANSAC computes after you have estimated a homography/fundamental matrix for example create a nice plot in Octave or similar software/library of the points, apply additional algorithms on the remaining set of filtered matches etc. After matching two images one gets a vector of

Extrakting Zip to SD-Card is very slow. How can i optimize performance?

痞子三分冷 提交于 2019-11-30 04:58:06
my app downloads a zip with about 350 files. A mix of JPG and HTML files. The function i wrote to do it works just fine but the unzipping takes for ever. At first i thought the reason might be that writing to the sd-card is slow. but when i unzip the same zip with an other app on my phone it works much faster. is there anything that i could do to optimize it? here is the code: private void extract() { try { FileInputStream inStream = new FileInputStream(targetFilePath); ZipInputStream zipStream = new ZipInputStream(new BufferedInputStream(inStream)); ZipEntry entry; ZipFile zip = new ZipFile

Java - Regex extract date from string

喜你入骨 提交于 2019-11-29 23:43:28
问题 I need to extract date from this string: BB inform: buy your tickect, final card number xxxx, $ 00,00, on 04/10, at 11:28. If you don't recognize call 40032 2412. Also The full date 04/10/2015 The date pattern is dd/MM or dd/MM/yyyy The code: String mydata = "BB inform: buy your tickect, final card number xxxx, $ 00,00, on 04/10, at 11:28. If you don't recognize call 40032 2412."; Pattern p = Pattern.compile("(0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])[- /.](19|20)\\d\\d"); Matcher m = p

Extract Images and Words with coordinates and sizes from PDF

拥有回忆 提交于 2019-11-29 19:31:57
问题 I've read much about PDF extractions and libraries (as iText) but i just haven't found a solution to extract images and text (with coordinates) from a PDF. The task is to scan PDF with catalog of products and extract each image. There is an image code printed next to each image and also a list of product codes for products that are shown on the image. I know that there is no way to extract structured info from a PDF like this but with coordinates of all image and text objects I could write

Extract text from a XPS Document [closed]

偶尔善良 提交于 2019-11-29 18:10:53
i need to extract the text of a specific page from a XPS document. The extracted text should be written in a string. I need this to read out the extracted text using Microsofts SpeechLib. Please examples only in C#. Thanks Sanjay Add References to ReachFramework and WindowsBase and the following using statement: using System.Windows.Xps.Packaging; Then use this code: XpsDocument _xpsDocument=new XpsDocument("/path",System.IO.FileAccess.Read); IXpsFixedDocumentSequenceReader fixedDocSeqReader =_xpsDocument.FixedDocumentSequenceReader; IXpsFixedDocumentReader _document = fixedDocSeqReader

Text extraction with java html parsers

折月煮酒 提交于 2019-11-29 17:59:19
I want to use an html parser that does the following in a nice, elegant way Extract text (this is most important) Extract links, meta keywords Reconstruct original doc (optional but nice feature to have) From my investigation so far jericho seems to fit. Any other open source libraries you guys would recommend? I recently experimented with HtmlCleaner and CyberNekoHtml. CyberNekoHtml is a DOM/SAX parser that produces predictable results. HtmlCleaner is a tad faster, but quite often fails to produce accurate results. I would recommend CyberNekoHtml. CyberNekoHtml can do all of the things you

Extract the Text in a Element with JQuery

孤人 提交于 2019-11-29 13:25:33
问题 I want to extract the Text inside a Element with JQuery <div id="bla"> <span><strong>bla bla bla</strong>I want this text</span> </div> I want only the text "I want this text" without the strong-tag. How can I do that? 回答1: Try this... <script type="text/javascript"> //<![CDATA[ $(document).ready(function(){ $("#bla span").contents().each(function(i) { if(this.nodeName == "#text") alert(this.textContent); }); }); //]]> </script> This doesn't need to remove any other nodes from context, and