text-extraction

Extracting information from captured image in android

寵の児 提交于 2019-12-21 05:14:14
问题 This is my image: I used this link(tessaract) to capture and process the image: http://kurup87.blogspot.com/2012/03/android-ocr-tutorial-image-to-text.html But this is the issue, if this entire area is scanned, the return values are some garbage values, not accurate. But if I scan V516990, 2653, and the date separately. results are correct. My intention is to scan V516990 and 2653 in one go, without the user having to use the camera twice. Any comments are welcome! 回答1: Let the user take one

How can i read pdf in python? [duplicate]

对着背影说爱祢 提交于 2019-12-20 10:34:51
问题 This question already has answers here : How to extract text from a PDF file? (17 answers) Closed 2 years ago . How can i read pdf in python? I know one way of converting it to text , but i want to read the content directly from pdf. Can anyone explain which module in python is best for pdf extraction 回答1: You can USE PyPDF2 package #install pyDF2 pip install PyPDF2 # importing all the required modules import PyPDF2 # creating an object file = open('example.pdf', 'rb') # creating a pdf reader

Awk doesn't match all match all my entries

核能气质少年 提交于 2019-12-20 02:53:02
问题 I'm trying to make "a script" - essentially an awk command - to extract the prototypes of functions of C code in a .c file to generate automatically a header .h. I'm new with awk so I don't get all the details. This is a sample of the source .c : dict_t dictup(dict_t d, const char * key, const char * newval) { int i = dictlook(d, key); if (i == DICT_NOT_FOUND) { fprintf(stderr, "key \"%s\" doesn't exist.\n", key); dictdump(d); } else { strncpy(d.entry[i].val, newval, DICTENT_VALLENGTH); }

Apache PDFBox Remove Spaces between characters

青春壹個敷衍的年華 提交于 2019-12-19 03:14:11
问题 We are using PDFBox to extract text from PDF's. Some PDF's text can't be extract correctly. The following image shows a part from the PDF as image: After text extraction we get the following text: 3, 8 5 EU R 1 Netto 38,50 EUR 4,00 (Spaces are added between ',' and '8') Here is our code: PDDocument pdf = PDDocument.load(reuseableInputStream); PDFTextStripper pdfStripper = new PDFTextStripper(); pdfStripper.setSortByPosition(true); String text = pdfStripper.getText(pdf); We tried to play with

Apache PDFBox Remove Spaces between characters

浪子不回头ぞ 提交于 2019-12-19 03:14:02
问题 We are using PDFBox to extract text from PDF's. Some PDF's text can't be extract correctly. The following image shows a part from the PDF as image: After text extraction we get the following text: 3, 8 5 EU R 1 Netto 38,50 EUR 4,00 (Spaces are added between ',' and '8') Here is our code: PDDocument pdf = PDDocument.load(reuseableInputStream); PDFTextStripper pdfStripper = new PDFTextStripper(); pdfStripper.setSortByPosition(true); String text = pdfStripper.getText(pdf); We tried to play with

tag generation from a small text content (such as tweets)

核能气质少年 提交于 2019-12-17 23:13:14
问题 I have already asked a similar question earlier but I have notcied that I have big constrain: I am working on small text sets suchs as user Tweets to generate tags(keywords). And it seems like the accepted suggestion ( point-wise mutual information algorithm) is meant to work on bigger documents. With this constrain(working on small set of texts), how can I generate tags ? Regards 回答1: Two Stage Approach for Multiword Tags You could pool all the tweets into a single larger document and then

Extract all email addresses from bulk text using jquery

∥☆過路亽.° 提交于 2019-12-17 05:49:27
问题 I'm having the this text below: sdabhikagathara@rediffmail.com, "assdsdf" <dsfassdfhsdfarkal@gmail.com>, "rodnsdfald ferdfnson" <rfernsdfson@gmail.com>, "Affdmdol Gondfgale" <gyfanamosl@gmail.com>, "truform techno" <pidfpinfg@truformdftechnoproducts.com>, "NiTsdfeSh ThIdfsKaRe" <nthfsskare@ysahoo.in>, "akasdfsh kasdfstla" <akashkatsdfsa@yahsdfsfoo.in>, "Bisdsdfamal Prakaasdsh" <bimsdaalprakash@live.com>,; "milisdfsfnd ansdfasdfnsftwar" <dfdmilifsd.ensfdfcogndfdfatia@gmail.com> Here emails are

PDF Text Extraction with Coordinates

房东的猫 提交于 2019-12-17 05:36:40
问题 I would like to extract text from a portion (using coordinates) of PDF using Ghostscript. Can anyone help me out? 回答1: Yes, with Ghostscript, you can extract text from PDFs. But no, it is not the best tool for the job. And no, you cannot do it in "portions" (parts of single pages). What you can do: extract the text of a certain range of pages only. First: Ghostscript's txtwrite output device (not so good) gs \ -dBATCH \ -dNOPAUSE \ -sDEVICE=txtwrite \ -dFirstPage=3 \ -dLastPage=5 \

Python module for converting PDF to text [closed]

雨燕双飞 提交于 2019-12-16 19:56:28
问题 Closed. This question is off-topic. It is not currently accepting answers. Want to improve this question? Update the question so it's on-topic for Stack Overflow. Closed 4 years ago . Which are the best Python modules to convert PDF files into text? 回答1: Try PDFMiner. It can extract text from PDF files as HTML, SGML or "Tagged PDF" format. The Tagged PDF format seems to be the cleanest, and stripping out the XML tags leaves just the bare text. A Python 3 version is available under: https:/

iText: Extracted text from pdf file using LocationTextExtractionStrategy is in wrong order

老子叫甜甜 提交于 2019-12-14 03:45:50
问题 I am using iText to extract some text from a pdf file at a specific location. In order to do that I am using the LocationTextExtractionStrategy: public static void main(String[] args) throws Exception { PdfReader pdfReader = new PdfReader("location_text_extraction_test.pdf"); Rectangle rectangle = new Rectangle(38, 0, 516, 516); RenderFilter[] filter = {new RegionTextRenderFilter(rectangle)}; TextExtractionStrategy strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy()