Extract Data from PDF and Add to Worksheet

后端 未结 8 2051
情深已故
情深已故 2020-12-01 01:35

I am trying to extract the data from a PDF document into a worksheet. The PDFs show and text can be manually copied and pasted into the Excel document.

I am currentl

相关标签:
8条回答
  • 2020-12-01 02:07

    Over time, I have found that extracting text from PDFs in a structured format is tough business. However if you are looking for an easy solution, you might want to consider XPDF tool pdftotext.

    Pseudocode to extract the text would include:

    1. Using SHELL VBA statement to extract the text from PDF to a temporary file using XPDF
    2. Using sequential file read statements to read the temporary file contents into a string
    3. Pasting the string into Excel

    Simplified example below:

        Sub ReadIntoExcel(PDFName As String)
            'Convert PDF to text
            Shell "C:\Utils\pdftotext.exe -layout " & PDFName & " tempfile.txt"
    
            'Read in the text file and write to Excel
            Dim TextLine as String
            Dim RowNumber as Integer
            Dim F1 as Integer
            RowNumber = 1
            F1 = Freefile()
            Open "tempfile.txt" for Input as #F1
                While Not EOF(#F1)
                    Line Input #F1, TextLine
                    ThisWorkbook.WorkSheets(1).Cells(RowNumber, 1).Value = TextLine
                    RowNumber = RowNumber + 1
                Wend
            Close #F1
        End Sub
    
    0 讨论(0)
  • 2020-12-01 02:08

    Since I do not prefer to rely on external libraries and/or other programs, I have extended your solution so that it works. The actual change here is using the GetFromClipboard function instead of Paste which is mainly used to paste a range of cells. Of course, the downside is that the user must not change focus or intervene during the whole process.

    Dim pathPDF As String, textPDF As String
    Dim openPDF As Object
    Dim objPDF As MsForms.DataObject
    
    pathPDF = "C:\some\path\data.pdf"
    Set openPDF = CreateObject("Shell.Application")
    openPDF.Open (pathPDF)
    'TIME TO WAIT BEFORE/AFTER COPY AND PASTE SENDKEYS
    Application.Wait Now + TimeValue("00:00:2")
    SendKeys "^a"
    Application.Wait Now + TimeValue("00:00:2")
    SendKeys "^c"
    Application.Wait Now + TimeValue("00:00:1")
    
    AppActivate ActiveWorkbook.Windows(1).Caption
    objPDF.GetFromClipboard
    textPDF = objPDF.GetText(1)
    MsgBox textPDF
    

    If you're interested see my project in github.

    0 讨论(0)
  • 2020-12-01 02:11

    This doesn't seem to work with the Adobe Type library. As soon as it gets to Open, I get a 429 error. Acrobat works fine though...

    0 讨论(0)
  • 2020-12-01 02:12

    I know this is an old issue but I just had to do this for a project at work, and I am very surprised that nobody has thought of this solution yet: Just open the .pdf with Microsoft word.

    The code is a lot easier to work with when you are trying to extract data from a .docx because it opens in Microsoft Word. Excel and Word play well together because they are both Microsoft programs. In my case, the file of question had to be a .pdf file. Here's the solution I came up with:

    1. Choose the default program to open .pdf files to be Microsoft Word
    2. The first time you open a .pdf file with word, a dialogue box pops up claiming word will need to convert the .pdf into a .docx file. Click the check box in the bottom left stating "do not show this message again" and then click OK.
    3. Create a macro that extracts data from a .docx file. I used MikeD's Code as a resource for this.
    4. Tinker around with the MoveDown, MoveRight, and Find.Execute methods to fit the need of your task.

    Yes you could just convert the .pdf file to a .docx file but this is a much simpler solution in my opinion.

    0 讨论(0)
  • 2020-12-01 02:12

    Copying and pasting by user interactions emulation could be not reliable (for example, popup appears and it switches the focus). You may be interested in trying the commercial ByteScout PDF Extractor SDK that is specifically designed to extract data from PDF and it works from VBA. It is also capable of extracting data from invoices and tables as CSV using VB code.

    Here is the VBA code for Excel to extract text from given locations and save them into cells in the Sheet1:

    Private Sub CommandButton1_Click()
    
    ' Create TextExtractor object
    ' Set extractor = CreateObject("Bytescout.PDFExtractor.TextExtractor")
    Dim extractor As New Bytescout_PDFExtractor.TextExtractor
    
    extractor.RegistrationName = "demo"
    extractor.RegistrationKey = "demo"
    
    ' Load sample PDF document
    extractor.LoadDocumentFromFile ("c:\sample1.pdf")
    
    ' Get page count
    pageCount = extractor.GetPageCount()
    
    Dim wb As Workbook
    Dim ws As Worksheet
    Dim TxtRng  As Range
    
    Set wb = ActiveWorkbook
    Set ws = wb.Sheets("Sheet1")
    
    For i = 0 To pageCount - 1
                RectLeft = 10
                RectTop = 10
                RectWidth = 100
                RectHeight = 100
    
                ' check the same text is extracted from returned coordinates
                extractor.SetExtractionArea RectLeft, RectTop, RectWidth, RectHeight
                ' extract text from given area
                extractedText = extractor.GetTextFromPage(i)
    
                ' insert rows
                ' Rows(1).Insert shift:=xlShiftDown
                ' write cell value
                 Set TxtRng = ws.Range("A" & CStr(i + 2))
                 TxtRng.Value = extractedText
    
    Next
    
    Set extractor = Nothing
    
    
    End Sub
    

    Disclosure: I am related to ByteScout

    0 讨论(0)
  • 2020-12-01 02:22

    Using Bytescout PDF Extractor SDK is a good option. It is cheap and gives plenty of PDF related functionality. One of the answers above points to the dead page Bytescout on GitHub. I am providing a relevant working sample to extract table from PDF. You may use it to export in any format.

    Set extractor = CreateObject("Bytescout.PDFExtractor.StructuredExtractor")
    
    extractor.RegistrationName = "demo"
    extractor.RegistrationKey = "demo"
    
    ' Load sample PDF document
    extractor.LoadDocumentFromFile "../../sample3.pdf"
    
    For ipage = 0 To extractor.GetPageCount() - 1 
    
        ' starting extraction from page #"
        extractor.PrepareStructure ipage
    
        rowCount = extractor.GetRowCount(ipage)
    
        For row = 0 To rowCount - 1 
            columnCount = extractor.GetColumnCount(ipage, row)
    
            For col = 0 To columnCount-1
                WScript.Echo "Cell at page #" +CStr(ipage) + ", row=" & CStr(row) & ", column=" & _
                    CStr(col) & vbCRLF & extractor.GetCellValue(ipage, row, col)
            Next
        Next
    Next
    

    Many more samples available here: https://github.com/bytescout/pdf-extractor-sdk-samples

    0 讨论(0)
提交回复
热议问题