Excel VBA Web Scraping Table Elements from a <frameset> and a <frame>

你。 提交于 2021-02-20 03:36:15

问题


I am trying to scrape some table-looking items from a website into Excel.

I'm no stranger to coding in general, though I'm pretty new to VBA in an Excel sense :)

I have tried using Excel's Data>From Web interface, it's not recognizing the table. I'm guessing it's because it's built using (or at least that's what my Google-Fu has lead me to understand).

Snipping of what the second table looks like

<html>

<frame title="links" ...>...</frame>

<frame title="queue">
#document

<head>...</head>
<body>
<div id="container>
<script>...</script>
<div>

<table id="oTable">

<colgroup>...</colgroup>

<thead>...</thead>


<tbody>
  <tr onclick="changeHighlight( 'eid0' )" id="eid0" class="queryshaded">
    <td nowrap=""><a onclick="javascript:window.open('IWViewer.jsp?id=3.5599976.5599976');" title="Open Image" href="javascript:doNothing();"><img title="Open Image" border="0" alt="Open Image" src="URL.gif"></a>&nbsp;<a onclick="javascript:window.open('URL','_newtab');" title="Open Workitem" href="javascript:doNothing();"><img title="Open Workitem" border="0" alt="Open Workitem" src="URL.gif"></a>
    </td><td scope="row" nowrap=""><a href="URL" target="_Blank">12345</a></td>
    <td nowrap=""><a href="`" target="_Blank">28/08/2018 17:00:49</a></td>
    <td nowrap=""><a href="URL" target="_Blank">11/09/2018 16:28:39</a></td>
    <td nowrap=""><a href="URL" target="_Blank">5,599,976</a></td>
    <td nowrap=""><a href="URL" target="_Blank">dijm</a></td></tr>
  <tr onclick="changeHighlight( 'eid1' )" id="eid1" class="queryunshaded">
    <td nowrap=""><a onclick="javascript:window.open('IWViewer.jsp?id=3.6443276.6443276');" title="Open Image" href="javascript:doNothing();"><img title="Open Image" border="0" alt="Open Image" src="URL.gif"></a>&nbsp;<a onclick="javascript:window.open('URL;id=3.6443276.6443276','_newtab');" title="Open Workitem" href="javascript:doNothing();"><img title="Open Workitem" border="0" alt="Open Workitem" src="URL.gif"></a>
    </td><td scope="row" nowrap=""><a href="URL" target="_Blank">67890</a></td>
    <td nowrap=""><a href="URL" target="_Blank">25/06/2019 11:01:01</a></td>
    <td nowrap=""><a href="URL" target="_Blank">09/07/2019 10:32:32</a></td>
    <td nowrap=""><a href="URL" target="_Blank">6,443,276</a></td>
    <td nowrap=""><a href="URL" target="_Blank"></a></td></tr>
  <tr onclick="changeHighlight( 'eid2' )" id="eid2" class="queryshaded">
    <td nowrap=""><a onclick="javascript:window.open('IWViewer.jsp?id=3.6443287.6443287');" title="Open Image" href="javascript:doNothing();"><img title="Open Image" border="0" alt="Open Image" src="URL.gif"></a>&nbsp;<a onclick="javascript:window.open('URL;id=3.6443287.6443287','_newtab');" title="Open Workitem" href="javascript:doNothing();"><img title="Open Workitem" border="0" alt="Open Workitem" src="URL.gif"></a>
    </td><td scope="row" nowrap=""><a href="URL" target="_Blank">23456</a></td>
    <td nowrap=""><a href="URL" target="_Blank">25/06/2019 11:01:24</a></td>
    <td nowrap=""><a href="URL" target="_Blank">09/07/2019 10:35:30</a></td>
    <td nowrap=""><a href="URL" target="_Blank">6,443,287</a></td>
    <td nowrap=""><a href="URL" target="_Blank"></a></td></tr>
  <tr onclick="changeHighlight( 'eid3' )" id="eid3" class="queryunshaded">
    <td nowrap=""><a onclick="javascript:window.open('IWViewer.jsp?id=3.6443339.6443339');" title="Open Image" href="javascript:doNothing();"><img title="Open Image" border="0" alt="Open Image" src="URL.gif"></a>&nbsp;<a onclick="javascript:window.open('URL;id=3.6443339.6443339','_newtab');" title="Open Workitem" href="javascript:doNothing();"><img title="Open Workitem" border="0" alt="Open Workitem" src="URL.gif"></a>
    </td><td scope="row" nowrap=""><a href="URL" target="_Blank">78901</a></td>
    <td nowrap=""><a href="URL" target="_Blank">25/06/2019 11:06:02</a></td>
    <td nowrap=""><a href="URL" target="_Blank">09/07/2019 10:40:39</a></td>
    <td nowrap=""><a href="URL" target="_Blank">6,443,339</a></td>
    <td nowrap=""><a href="URL" target="_Blank"></a></td></tr>
  <tr onclick="changeHighlight( 'eid4' )" id="eid4" class="queryshaded">
    <td nowrap=""><a onclick="javascript:window.open('IWViewer.jsp?id=3.6443344.6443344');" title="Open Image" href="javascript:doNothing();"><img title="Open Image" border="0" alt="Open Image" src="URL.gif"></a>&nbsp;<a onclick="javascript:window.open('URL;id=3.6443344.6443344','_newtab');" title="Open Workitem" href="javascript:doNothing();"><img title="Open Workitem" border="0" alt="Open Workitem" src="URL.gif"></a>
    </td><td scope="row" nowrap=""><a href="URL" target="_Blank">34567</a></td>
    <td nowrap=""><a href="URL" target="_Blank">25/06/2019 11:06:17</a></td>
    <td nowrap=""><a href="URL" target="_Blank">09/07/2019 10:40:43</a></td>
    <td nowrap=""><a href="URL" target="_Blank">6,443,344</a></td>
    <td nowrap=""><a href="URL" target="_Blank"></a></td></tr>

I have tried various solutions that look somewhat like this: https://www.ozgrid.com/forum/forum/other-software-applications/excel-and-web-browsers-help/131683-extracting-data-from-a-grid-on-webpage and Scraping data from website using vba

and trying to define the frames themselves to try and get the info from there? (again: new to Excel VBA)

    'set myHTMLDoc to the main pages IE document
    Dim myHTMLDoc As HTMLDocument
    Set myHTMLDoc = ie.Document

    'set myHTMLFrame2 as the 2nd frame of the main page (index starts at 0)
    Dim myHTMLFrame2 As HTMLDocument
    Set myHTMLFrame2 = myHTMLDoc.Frames(1).Document

With the above block of code I'm getting a "Run-time error '438' Without the above block I'm getting a "Run-time error '1004'

The info I eventually want is in each row:

    </td><td scope="row" nowrap=""><a href="URL" target="_Blank">67890</a></td>
    <td nowrap=""><a href="URL" target="_Blank">25/06/2019 11:01:01</a></td>
    <td nowrap=""><a href="URL" target="_Blank">09/07/2019 10:32:32</a></td>
    <td nowrap=""><a href="URL" target="_Blank">6,443,276</a></td>

Ideally I'd like to dump each element into a cell

67890 | 25/06/2019 11:01:01 | 09/07/2019 10:32:32 | 6,443,276

There's 20 of these rows on each page (there's a button to press to get to the next page which I'll figure out later...hopefully haha)

Massive premptive Thank You to anyone who can help :)

-EDIT- This is the code that I'm currently working with (not precious about it :P )

Private Sub CommandButton1_Click()


    Dim ie     As Object
    Dim html   As Object
    Dim objElementTR As Object
    Dim objTR  As Object
    Dim objElementsTD As Object
    Dim objTD  As Object
    Dim result As String
    Dim intRow As Long
    Dim intCol As Long

    Set ie = CreateObject("InternetExplorer.Application")
    ie.Navigate "URL"
    ie.Visible = True     ' loop until page is loaded
    Do Until (ie.ReadyState = 4 And Not ie.Busy)
        DoEvents
    Loop

    'set myHTMLDoc to the main pages IE document
    Dim myHTMLDoc As HTMLDocument
    Set myHTMLDoc = ie.Document

    'set myHTMLFrame2 as the 2nd frame of the main page (index starts at 0)
    Dim myHTMLFrame2 As HTMLDocument
    Set myHTMLFrame2 = ie.Document.querySelector("[title=queue]").contentDocument.getElementById("oTable")

    result = myHTMLFrame2
    Set html = CreateObject("htmlfile")
    myHTMLFrame2 = result
    Set objElementTR = html.getElementsByTagName("tr")
    ReDim myarray(0 To objElementTR.Length, 0 To 10)
    For Each objTR In objElementTR
        intRow = intRow + 1
        Set objElementsTD = objTR.getElementsByTagName("td")
        For Each objTD In objElementsTD
            myarray(intRow, intCol) = objTD.innerText
            intCol = intCol + 1
        Next objTD
        intCol = 0
    Next objTR
    With Sheets(1).Cells(1, 1).Cells(Rows.Count, "A").End(xlUp).Offset(1, 0)
        .Resize(UBound(myarray), UBound(myarray, 2)).Value = myarray
    End With



End Sub

回答1:


You could try isolating the frame by its title attribute, then go via contentDocument and get the table by id

ie.document.querySelector("[title=queue]").contentDocument.querySelector("#oTable")

Then end .querySelector("#oTable") can be interchanged with .getElementById("oTable")

I would then dump the .outerHTML of the table via clipboard so as to paste table direct into sheet.



来源:https://stackoverflow.com/questions/56861132/excel-vba-web-scraping-table-elements-from-a-frameset-and-a-frame

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!