Web scraping without specified name, id, or class attached to the data

一笑奈何 提交于 2021-01-29 17:27:24

问题


I am trying to track the status of shipping delivery and display it on an Excel tab.

This website https://webcsw.ocs.co.jp/csw/ECSWG0201R00003P.do, displays data when the "Air wayBill No." is entered.
I managed to open Internet Explorer, enter the Air WayBill number, then click the search button.

Dim IE As Object
Set IE = CreateObject("InternetExplorer.Application")
IE.Navigate "https://webcsw.ocs.co.jp/csw/ECSWG0201R00000P.do"
IE.Visible = True

While IE.busy
    DoEvents
Wend

Set document = IE.document
With document
    .getElementsByName("edtAirWayBillNo")(0).Value = ThisWorkbook.Sheets("Sheet3").Range("B2")
    .getElementsByClassName("button btn_ex").Item.Click
End With

I couldn't find any flags like name, id or class.

How do I retrieve data from the chart section where they are just marked by 'tbody', 'tr' and 'td'?

I tried to use the .getElementsByTagName method.

The section of the website's html where I need to retrieve data.

<table border="0" cellpadding="0" cellspacing="0" id="" style="border:#d0d0d0 1px dotted;" width="100%">
            <tbody id="chart_header">
                <tr>
                    <td rowspan="1" colspan="1" width="90px">Air WayBill No.</td>
                    <td rowspan="1" colspan="3" width="370px">Latest Tracking Record</td>
                    <td rowspan="1" colspan="1" width="150px">Shipper</td>
                    <td rowspan="1" colspan="1" width="150px">Receiver</td>
                    <td rowspan="1" colspan="1" width="40px">Pcs</td>
                    <td rowspan="1" colspan="1" width="80px">Actual Weight</td>
                    <td rowspan="1" colspan="1" width="70px">Vol. Weight</td>
                </tr>
            </tbody>

            <tbody id="chart" style="height: auto">
            <!-- record start -->
            
            
            
                 <tr>
                     <td>
                         <a href="#0" shape="rect">
                             25017894414
                         </a>
                     </td>
                    <td width="160px">
                         <div style=" position:relative; width:100%;align:left;vertical-align: 
                                      middle;">&nbsp;
                          <div style="position:absolute;top:0pt;left: 1pt; margin: 1px;">
                              Fri
                          </div>
                          <div style="position:absolute;top:0pt;left:25pt;">
                              04Sep2020
                          </div>
                          <div style="position:absolute;top:0pt;left:80pt;">
                              09:40
                          </div>
                         </div>
                     </td>
                     <td width="90px">
                         <input type="text" value="Product Scanned" style="width:90px;" tabindex="-1" class="readonly_left" readonly="readonly">
                     </td>
                     <td width="130px" style="border-width:1px 1px 1px 0px;">
                         
                             <img src="./image/tpStatus_BLUE4.gif" width="130px" height="16px" class="middle">
                         
                     </td>
                     <td>
                         <input type="text" value="SUZHOU/CHINA" style="width:145px;" tabindex="-1" class="readonly_left" readonly="readonly">
                     </td>
                     <td>
                        <input type="text" value="AICHI KEN/JAPAN" style="width:145px;" tabindex="-1" class="readonly_left" readonly="readonly">
                     </td>
                     <td class="t_right">
                         <input type="text" value="1" style="width:40px;" tabindex="-1" class="readonly_right" readonly="readonly">
                     </td>
                     <td class="t_right">
                         <input type="text" value="1.9kg" style="width:70px;" tabindex="-1" class="readonly_right" readonly="readonly">
                     </td>
                     <td class="t_right">
                         <input type="text" value="1.2kg" style="width:70px;" tabindex="-1" class="readonly_right" readonly="readonly">
                     </td>
                 </tr>
            
            
            <!-- record end -->
            </tbody>
        </table>

回答1:


Provided you wait for results to load you should be able to use ie.document.querySelector("#charttitle + table") to grab the table and use the clipboard to copy the outerHTML of that node as a table to excel. You could loop until table has results with a time-out (preferable), or use an explicit wait.

This

#charttitle + table

is a css selector that looks for the table which is the adjacent sibling to the element with id charttitle

'wait condition after click to submit 
Dim clipboard As Object

Set clipboard = GetObject("New:{1C3B4210-F441-11CE-B9EA-00AA006B1A69}")

clipboard.SetText ie.document.querySelector("#charttitle + table").outerHTML
clipboard.PutInClipboard
ActiveSheet.Cells(1, 1).PasteSpecial

You can get all those tables with querySelectorAll and a css general sibling combinator ~

Dim tables As Object, i As Long

Set tables = ie.document.querySelectorAll("#charttitle ~ table")

You then need to loop from For i = 0 to tables.length -1 and access the current table in the loop with tables.item(i).outerHTML and write out to the correctly determined desired output row.

Read about CSS selectors here:

https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors

And finding last row

https://www.rondebruin.nl/win/s9/win005.htm

Remember to check if scraping is allowed under the terms of service.



来源:https://stackoverflow.com/questions/63758241/web-scraping-without-specified-name-id-or-class-attached-to-the-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!