pull data from website using VBA excel multiple classname

后端 未结 3 1392
梦谈多话
梦谈多话 2020-12-15 13:40

I know this has been asked many times, but haven\'t seen a clear answer for looping thru a div and findind tags with the same classname.

My first question:

相关标签:
3条回答
  • 2020-12-15 14:21

    CSS selector:

    You could also use a CSS selector of #images img[src^='img/'].

    This says elements with id of images that contain tagname img with attribute src having value starting with 'img/'.

    The # is for id; [] for attribute; ^ for starts with; #images img, img within images.


    CSS query:


    As more than one element will be matched you would use the .querySelectorAll method of document and then loop the length of the returned nodeList.

    VBA Code:

    Option Explicit
    Public Sub test()
        Dim html As HTMLDocument
        Set html = New HTMLDocument
    
        With CreateObject("WINHTTP.WinHTTPRequest.5.1")
            .Open "GET", "http://www.someurl.com", False
            .send
            html.body.innerHTML = .responseText
        End With
    
        Dim aNodeList As Object, iItem As Long
        Set aNodeList = html.querySelectorAll("#images img[src^='img/']")
        With ActiveSheet
            For iItem = 0 To aNodeList.Length - 1
                .Cells(iItem + 1, 1) = aNodeList.item(iItem).innerText
                '.Cells(iItem + 1, 1) = aNodeList(iItem).innerText '<== or potentially this syntax
            Next iItem
        End With
    End Sub
    
    0 讨论(0)
  • 2020-12-15 14:36

    Your first option is usually preferable since it is much faster than the second method, it sends a request directly to the web server and returns the response. This is much more efficient than automating Internet Explorer (the second option); automating IE is very slow, since you are effectively just browsing the site - it will inevitably result in more downloads as it must load all the resources in the page - images, scripts, css files etc. It will also run any Javascript on the page - all of this is usually not useful and you have to wait for it to finish before parsing the page.

    This however is a bit of a double edged sword - whilst much slower, if you are not familiar with html requests, automating Internet Explorer is substantially easier than the first method, especially when elements are generated dynamically or the page has a reliance on AJAX. It is also easier to automate IE when you need to access data in a site that requires you to log in since it will handle the relevant cookies for you. This is not to say that web scraping cannot be done with the first method, rather than it requires a deeper understanding of web technologies and the architecture of the site.

    A better option to the first method would be to use a different object to handle the request and response, using the WinHTTP library offers more resilience than the MSXML library and will generally handle any cookies automatically as well.

    As for parsing the data, in your first approach you have used late binding to create the HTML Object (htmlfile), whilst this reduces the need for a reference, it also reduces functionality. For example, when using late binding, you are missing out on the features added if the user has IE9 installed, specifically in this case the getElementsByClass name function.

    As such a third option (and my preferred method):

    Dim oHtml       As HTMLDocument
    Dim oElement    As Object
    
    Set oHtml = New HTMLDocument
    
    
    With CreateObject("WINHTTP.WinHTTPRequest.5.1")
        .Open "GET", "http://www.someurl.com", False
        .send
        oHtml.body.innerHTML = .responseText
    End With
    
    For Each oElement In oHtml.getElementsByClassName("imageElement")
        Debug.Print oElement.Children(0).src
    Next oElement
    
    'IE 8 alternative
    'For Each oElement In oHtml.getElementsByTagName("div")
    '    If oElement.className = "imageElement" Then
    '        Debug.Print oElement.Children(0).src
    '    End If
    'Next oElement
    

    This will require a reference setting to the Microsoft HTML Object Library - it will fail if the user does not have IE9 installed, but this can be handled and is becoming increasingly less relevant

    0 讨论(0)
  • 2020-12-15 14:43

    To print elements to cells replace:

    For Each oElement In oHtml.getElementsByClassName("imageElement")
        Debug.Print oElement.Children(0).src
    Next oElement
    

    With:

    Dim wsTarget as Worksheet
    dim i as Integer
    i=1
    set wsTarget=activeworkbook.worksheets("SomeSheet")
    
    For Each oElement In oHtml.getElementsByClassName("imageElement")
        wstarget.range("A" & i)=oElement.Children(0).src
        i=i+1
    Next
    

    'Corrected the syntax error on For

    0 讨论(0)
提交回复
热议问题