I know this has been asked many times, but haven\'t seen a clear answer for looping thru a div and findind tags with the same classname.
My first question:
CSS selector:
You could also use a CSS selector of #images img[src^='img/']
.
This says elements with id of images
that contain tagname img
with attribute src
having value starting with 'img/'
.
The #
is for id; []
for attribute; ^
for starts with; #images img
, img
within images
.
CSS query:
As more than one element will be matched you would use the .querySelectorAll method of document
and then loop the length of the returned nodeList
.
VBA Code:
Option Explicit
Public Sub test()
Dim html As HTMLDocument
Set html = New HTMLDocument
With CreateObject("WINHTTP.WinHTTPRequest.5.1")
.Open "GET", "http://www.someurl.com", False
.send
html.body.innerHTML = .responseText
End With
Dim aNodeList As Object, iItem As Long
Set aNodeList = html.querySelectorAll("#images img[src^='img/']")
With ActiveSheet
For iItem = 0 To aNodeList.Length - 1
.Cells(iItem + 1, 1) = aNodeList.item(iItem).innerText
'.Cells(iItem + 1, 1) = aNodeList(iItem).innerText '<== or potentially this syntax
Next iItem
End With
End Sub
Your first option is usually preferable since it is much faster than the second method, it sends a request directly to the web server and returns the response. This is much more efficient than automating Internet Explorer (the second option); automating IE is very slow, since you are effectively just browsing the site - it will inevitably result in more downloads as it must load all the resources in the page - images, scripts, css files etc. It will also run any Javascript on the page - all of this is usually not useful and you have to wait for it to finish before parsing the page.
This however is a bit of a double edged sword - whilst much slower, if you are not familiar with html requests, automating Internet Explorer is substantially easier than the first method, especially when elements are generated dynamically or the page has a reliance on AJAX. It is also easier to automate IE when you need to access data in a site that requires you to log in since it will handle the relevant cookies for you. This is not to say that web scraping cannot be done with the first method, rather than it requires a deeper understanding of web technologies and the architecture of the site.
A better option to the first method would be to use a different object to handle the request and response, using the WinHTTP library offers more resilience than the MSXML library and will generally handle any cookies automatically as well.
As for parsing the data, in your first approach you have used late binding to create the HTML Object (htmlfile), whilst this reduces the need for a reference, it also reduces functionality. For example, when using late binding, you are missing out on the features added if the user has IE9 installed, specifically in this case the getElementsByClass name function.
As such a third option (and my preferred method):
Dim oHtml As HTMLDocument
Dim oElement As Object
Set oHtml = New HTMLDocument
With CreateObject("WINHTTP.WinHTTPRequest.5.1")
.Open "GET", "http://www.someurl.com", False
.send
oHtml.body.innerHTML = .responseText
End With
For Each oElement In oHtml.getElementsByClassName("imageElement")
Debug.Print oElement.Children(0).src
Next oElement
'IE 8 alternative
'For Each oElement In oHtml.getElementsByTagName("div")
' If oElement.className = "imageElement" Then
' Debug.Print oElement.Children(0).src
' End If
'Next oElement
This will require a reference setting to the Microsoft HTML Object Library
- it will fail if the user does not have IE9 installed, but this can be handled and is becoming increasingly less relevant
To print elements to cells replace:
For Each oElement In oHtml.getElementsByClassName("imageElement")
Debug.Print oElement.Children(0).src
Next oElement
With:
Dim wsTarget as Worksheet
dim i as Integer
i=1
set wsTarget=activeworkbook.worksheets("SomeSheet")
For Each oElement In oHtml.getElementsByClassName("imageElement")
wstarget.range("A" & i)=oElement.Children(0).src
i=i+1
Next
'Corrected the syntax error on For