text-extraction

How to extract text with iTextSharp 4.1.6?

别等时光非礼了梦想. 提交于 2021-02-18 21:55:12
问题 iTextSharp 4.1.6 is the last version licensed under LGPL and is free to use in commercial purpose without paying license fees. It might be interesting for some and for me, how to extract text with this version. Does anyone have an idea? 回答1: I had to hack this together manually as I was in the same boat as you. Hopefully this well help. It's probably not perfect, but I was able to get the text I needed out of the document this way. fileName is a string variable/parameter to the PDF file. var

Regular expression (extract key/value pairs)

半腔热情 提交于 2021-01-28 18:33:31
问题 I'm trying to extract a list (match) of key/value pairs from a string. Ex : PATH_1:"/", PATH_2:"/OtherPath", TODAY:"2016-06-27",XYZ :"1234" This should give : Key Value PATH_1 / PATH_2 /OtherPath TODAY 2016-06-27 XYZ 1234 Here is what I have so far as regex : ((?:"[^"]*"|[^:,])*):((?:"[^"]*"|[^:,])*) This is well working except that when I'm adding a path having a '\'. Ex : PATH_1:"c:\", PATH_2:"c:\OtherPath", TODAY:"2016-06-27" I don't know how to instruct to regex expression to jump over

Extracting text from a rectangle using iText ( .Net ) does give me the entire line

馋奶兔 提交于 2020-12-06 19:21:37
问题 The following is the code (using iText for.Net Version 7.0.4.0) that i am using for extracting the text from a pdf. What i have observed during my testing is it works well by only extracting the content within a rectangle for most of the pdf's. But for few of them it gives the entire line from the pdf. I know that the text snippets that intersect with the rect (so part of the text may be outside rect, iText doesn't cut text snippets in pieces). But I want to understand what parameter in the

Extracting text from a rectangle using iText ( .Net ) does give me the entire line

…衆ロ難τιáo~ 提交于 2020-12-06 19:19:51
问题 The following is the code (using iText for.Net Version 7.0.4.0) that i am using for extracting the text from a pdf. What i have observed during my testing is it works well by only extracting the content within a rectangle for most of the pdf's. But for few of them it gives the entire line from the pdf. I know that the text snippets that intersect with the rect (so part of the text may be outside rect, iText doesn't cut text snippets in pieces). But I want to understand what parameter in the

How to use the Amazon Textract with PDF files

可紊 提交于 2020-08-10 08:42:52
问题 I already can use the textract but with JPEG files. I would like to use it with PDF files. I have the code bellow: import boto3 # Document documentName = "Path to document in JPEG" # Read document content with open(documentName, 'rb') as document: imageBytes = bytearray(document.read()) # Amazon Textract client textract = boto3.client('textract') documentText = "" # Call Amazon Textract response = textract.detect_document_text(Document={'Bytes': imageBytes}) #print(response) # Print detected

Excel VBA to return Page Count from protected PDF file

微笑、不失礼 提交于 2020-07-24 05:48:53
问题 I need to retrieve the number of pages in PDF files (with security ), using Excel VBA. The following code works when there is no security enabled in the PDF file: Sub PDFandNumPages() Dim Folder As Object Dim file As Object Dim fso As Object Dim iExtLen As Integer, iRow As Integer Dim sFolder As String, sExt As String Dim sPDFName As String sExt = "pdf" iExtLen = Len(sExt) iRow = 1 ' Must have a '\' at the end of path sFolder = "C:\test\" Set fso = CreateObject("Scripting.FileSystemObject")