Extract Data from PDF and Add to Worksheet

后端 未结 8 2062
情深已故
情深已故 2020-12-01 01:35

I am trying to extract the data from a PDF document into a worksheet. The PDFs show and text can be manually copied and pasted into the Excel document.

I am currentl

相关标签:
8条回答
  • 2020-12-01 02:30

    To improve the solution of Slinky Sloth I had to add this beforere get from clipboard :

    Set objPDF = New MSForms.DataObject
    

    Sadly it didn't worked for a pdf of 10 pages.

    0 讨论(0)
  • 2020-12-01 02:34

    You can open the PDF file and extract its contents using the Adobe library (which I believe you can download from Adobe as part of the SDK, but it comes with certain versions of Acrobat as well)

    Make sure to add the Library to your references too (On my machine it is the Adobe Acrobat 10.0 Type Library, but not sure if that is the newest version)

    Even with the Adobe library it is not trivial (you'll need to add your own error-trapping etc):

    Function getTextFromPDF(ByVal strFilename As String) As String
       Dim objAVDoc As New AcroAVDoc
       Dim objPDDoc As New AcroPDDoc
       Dim objPage As AcroPDPage
       Dim objSelection As AcroPDTextSelect
       Dim objHighlight As AcroHiliteList
       Dim pageNum As Long
       Dim strText As String
    
       strText = ""
       If (objAvDoc.Open(strFilename, "") Then
          Set objPDDoc = objAVDoc.GetPDDoc
          For pageNum = 0 To objPDDoc.GetNumPages() - 1
             Set objPage = objPDDoc.AcquirePage(pageNum)
             Set objHighlight = New AcroHiliteList
             objHighlight.Add 0, 10000 ' Adjust this up if it's not getting all the text on the page
             Set objSelection = objPage.CreatePageHilite(objHighlight)
    
             If Not objSelection Is Nothing Then
                For tCount = 0 To objSelection.GetNumText - 1
                   strText = strText & objSelection.GetText(tCount)
                Next tCount
             End If
          Next pageNum
          objAVDoc.Close 1
       End If
    
       getTextFromPDF = strText
    
    End Function
    

    What this does is essentially the same thing you are trying to do - only using Adobe's own library. It's going through the PDF one page at a time, highlighting all of the text on the page, then dropping it (one text element at a time) into a string.

    Keep in mind what you get from this could be full of all kinds of non-printing characters (line feeds, newlines, etc) that could even end up in the middle of what look like contiguous blocks of text, so you may need additional code to clean it up before you can use it.

    Hope that helps!

    0 讨论(0)
提交回复
热议问题