Parse HTML content in VBA

匿名 (未验证) 提交于 2019-12-03 02:11:02

问题:

I have a question relating to HTML parsing. I have a website with some products and I would like to catch text within page into my current spreadsheet. This spreadsheet is quite big but contains ItemNbr in 3rd column, I expect the text in the 14th column and one row corresponds to one product (item).

My idea is to fetch the 'Material' on the webpage which is inside the Innertext after tag. The id number changes from one page to page (sometimes ).

Here is the structure of the website:

<div style="position:relative;">     <div></div>     <table id="list-table" width="100%" tabindex="1" cellspacing="0" cellpadding="0" border="0" role="grid" aria-multiselectable="false" aria-labelledby="gbox_list-table" class="ui-jqgrid-btable" style="width: 930px;">         <tbody>             <tr class="jqgfirstrow" role="row" style="height:auto">                 <td ...</td>                 <td ...</td>             </tr>             <tr role="row" id="1" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">                 <td ...</td>                 <td ...</td>             </tr>             <tr role="row" id="2" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">                 <td ...</td>                 <td ...</td>             </tr>             <tr role="row" id="3" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">                 <td ...</td>                 <td ...</td>             </tr>             <tr role="row" id="4" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">                 <td ...</td>                 <td ...</td>             </tr>             <tr role="row" id="5" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">                 <td ...</td>                 <td ...</td>             </tr>             <tr role="row" id="6" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">                 <td ...</td>                 <td ...</td>             </tr>             <tr role="row" id="7" tabindex="-1" class="ui-widget-content jqgrow ui-row-ltr">                 <td role="gridcell" style="padding-left:10px" title="Material" aria-describedby="list-table_">Material</td>                 <td role="gridcell" style="" title="600D polyester." aria-describedby="list-table_">600D polyester.</td>             </tr>                        <tr ...>             </tr>         </tbody>     </table> </div> 

I would like to get "600D Polyester" as a result.

My (not working) code snippet is as is:

Sub ParseMaterial()      Dim Cell As Integer     Dim ItemNbr As String      Dim AElement As Object     Dim AElements As IHTMLElementCollection Dim IE As MSXML2.XMLHTTP60 Set IE = New MSXML2.XMLHTTP60  Dim HTMLDoc As MSHTML.HTMLDocument Dim HTMLBody As MSHTML.HTMLBody  Set HTMLDoc = New MSHTML.HTMLDocument Set HTMLBody = HTMLDoc.body  For Cell = 1 To 5                            'I iterate through the file row by row      ItemNbr = Cells(Cell, 3).Value           'ItemNbr isin the 3rd Column of my spreadsheet      IE.Open "GET", "http://www.example.com/?item=" & ItemNbr, False     IE.send      While IE.ReadyState <> 4         DoEvents     Wend      HTMLBody.innerHTML = IE.responseText      Set AElements = HTMLDoc.getElementById("list-table").getElementsByTagName("tr")     For Each AElement In AElements         If AElement.Title = "Material" Then             Cells(Cell, 14) = AElement.nextNode.value     'I write the material in the 14th column         End If     Next AElement          Application.Wait (Now + TimeValue("0:00:2"))  Next Cell 

Thanks for your help !

回答1:

Just a couple things that hopefully will get you in the right direction:

  • clean up a bit: remove the readystate property testing loop. The value returned by the readystate property will never change in this context - code will pause after the send instruction, to resume only once the server response is received, or has failed to do so. The readystate property will be set accordingly, and the code will resume execution. You should still test for the ready state, but the loop is just unnecessary

  • target the right HTML elements: you are searching through the tr elements - while the logic of how you use these elements in your code actually looks to point to td elements

  • make sure the properties are actually available for the objects you are using them on: to help you with this, try and declare all your variable as specific objects instead of the generic Object. This will activate intellisense. If you have a difficult time finding the actual name of your object as defined in the relevant library in a first place, declare it as the generic Object, run your code, and then inspect the type of the object - by printing typename(your_object) to the debug window for instance. This should put you on your way

I have also included some code below that may help. If you still can't get this to work and you can share your urls - plz do that.

Sub getInfoWeb()      Dim cell As Integer     Dim xhr As MSXML2.XMLHTTP60     Dim doc As MSHTML.HTMLDocument     Dim table As MSHTML.HTMLTable     Dim tableCells As MSHTML.IHTMLElementCollection      Set xhr = New MSXML2.XMLHTTP60      For cell = 1 To 5          ItemNbr = Cells(cell, 3).Value          With xhr              .Open "GET", "http://www.example.com/?item=" & ItemNbr, False             .send              If .readyState = 4 And .Status = 200 Then                 Set doc = New MSHTML.HTMLDocument                 doc.body.innerHTML = .responseText             Else                 MsgBox "Error" & vbNewLine & "Ready state: " & .readyState & _                 vbNewLine & "HTTP request status: " & .Status             End If          End With          Set table = doc.getElementById("list-table")         Set tableCells = table.getElementsByTagName("td")          For Each tableCell In tableCells             If tableCell.getAttribute("title") = "Material" Then                 Cells(cell, 14).Value = tableCell.NextSibling.innerHTML             End If         Next tableCell      Next cell  End Sub 

EDIT: as a follow-up to the further information you provided in the comment below - and the additionnal comments I have added

'Determine your product number     'Open an xhr for your source url, and retrieve the product number from there - search for the tag which     'text include the "productnummer:" substring, and extract the product number from the outerstring     'OR     'if the product number consistently consists of the fctkeywords you are entering in your source url     'with two "0" appended - just build the product number like that 'Open an new xhr for this url "http://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2productspec-ajax.p?itemc=" & product_number & "&_search=false&rows=-1&page=1&sidx=&sord=asc" 'Load the response in an XML document, and retrieve the material information  Sub getInfoWeb()      Dim xhr As MSXML2.XMLHTTP60     Dim doc As MSXML2.DOMDocument60     Dim xmlCell As MSXML2.IXMLDOMElement     Dim xmlCells As MSXML2.IXMLDOMNodeList     Dim materialValueElement As MSXML2.IXMLDOMElement      Set xhr = New MSXML2.XMLHTTP60          With xhr              .Open "GET", "http://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2productspec-ajax.p?itemc=10031700&_search=false&rows=-1&page=1&sidx=&sord=asc", False             .send              If .readyState = 4 And .Status = 200 Then                 Set doc = New MSXML2.DOMDocument60                 doc.LoadXML .responseText             Else                 MsgBox "Error" & vbNewLine & "Ready state: " & .readyState & _                 vbNewLine & "HTTP request status: " & .Status             End If          End With          Set xmlCells = doc.getElementsByTagName("cell")          For Each xmlCell In xmlCells             If xmlCell.Text = "Materiaal" Then                 Set materialValueElement = xmlCell.NextSibling             End If         Next          MsgBox materialValueElement.Text  End Sub 

EDIT2: an alternative automating IE

Sub searchWebViaIE()     Dim ie As SHDocVw.InternetExplorer     Dim doc As MSHTML.HTMLDocument     Dim anchors As MSHTML.IHTMLElementCollection     Dim anchor As MSHTML.HTMLAnchorElement     Dim prodSpec As MSHTML.HTMLAnchorElement     Dim tableCells As MSHTML.IHTMLElementCollection     Dim materialValueElement As MSHTML.HTMLTableCell     Dim tableCell As MSHTML.HTMLTableCell      Set ie = New SHDocVw.InternetExplorer      With ie         .navigate "http://www.pfconcept.com/cgi-bin/wspd_pcdb_cgi.sh/y/y2facetmain.p?fctkeywords=100317&world=general#tabs-4"         .Visible = True          Do While .readyState <> READYSTATE_COMPLETE Or .Busy = True             DoEvents         Loop          Set doc = .document          Set anchors = doc.getElementsByTagName("a")          For Each anchor In anchors             If InStr(anchor.innerHTML, "Product Specificatie") <> 0 Then                 anchor.Click                 Exit For             End If         Next anchor          Do While .readyState <> READYSTATE_COMPLETE Or .Busy = True             DoEvents         Loop      End With      For Each anchor In anchors         If InStr(anchor.innerHTML, "Product Specificatie") <> 0 Then             Set prodSpec = anchor         End If     Next anchor      Set tableCells = doc.getElementById("list-table").getElementsByTagName("td")      If Not tableCells Is Nothing Then         For Each tableCell In tableCells             If tableCell.innerHTML = "Materiaal" Then                 Set materialValueElement = tableCell.NextSibling             End If         Next tableCell     End If      MsgBox materialValueElement.innerHTML  End Sub 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!