问题
I've written some code in vba to get all the links leading to the next page from a webpage. The highest number of next page links is 255. Running my script, I get all the links within 6906 links. That means the loop runs again and again and I'm overwriting stuffs. Filtering out duplicate links I could see that 254 unique links are there. My objective here is not to hardcode the highest page number to the link for iteration. Here is what I'm trying with:
Sub YifyLink()
Const link = "https://www.yify-torrent.org/search/1080p/"
Dim http As New XMLHTTP60, html As New HTMLDocument, htm As New HTMLDocument
Dim x As Long, y As Long, item_link as String
With http
.Open "GET", link, False
.send
html.body.innerHTML = .responseText
End With
For Each post In html.getElementsByClassName("pager")(0).getElementsByTagName("a")
If InStr(post.innerText, "Last") Then
x = Split(Split(post.href, "-")(1), "/")(0)
End If
Next post
For y = 0 To x
item_link = link & "t-" & y & "/"
With http
.Open "GET", item_link, False
.send
htm.body.innerHTML = .responseText
End With
For Each posts In htm.getElementsByClassName("pager")(0).getElementsByTagName("a")
I = I + 1: Cells(I, 1) = posts.href
Next posts
Next y
End Sub
Elements within which the links are:
<div class="pager"><a href="/search/1080p/" class="current">1</a> <a href="/search/1080p/t-2/">2</a> <a href="/search/1080p/t-3/">3</a> <a href="/search/1080p/t-4/">4</a> <a href="/search/1080p/t-5/">5</a> <a href="/search/1080p/t-6/">6</a> <a href="/search/1080p/t-7/">7</a> <a href="/search/1080p/t-8/">8</a> <a href="/search/1080p/t-9/">9</a> <a href="/search/1080p/t-10/">10</a> <a href="/search/1080p/t-11/">11</a> <a href="/search/1080p/t-12/">12</a> <a href="/search/1080p/t-13/">13</a> <a href="/search/1080p/t-14/">14</a> <a href="/search/1080p/t-15/">15</a> <a href="/search/1080p/t-16/">16</a> <a href="/search/1080p/t-17/">17</a> <a href="/search/1080p/t-18/">18</a> <a href="/search/1080p/t-19/">19</a> <a href="/search/1080p/t-20/">20</a> <a href="/search/1080p/t-21/">21</a> <a href="/search/1080p/t-22/">22</a> <a href="/search/1080p/t-23/">23</a> <a href="/search/1080p/t-2/">Next</a> <a href="/search/1080p/t-255/">Last</a> </div>
The results I'm getting (partial portion):
about:/search/1080p/t-20/
about:/search/1080p/t-21/
about:/search/1080p/t-22/
about:/search/1080p/t-23/
about:/search/1080p/t-255/
回答1:
The idea should be to scrape pages in a loop and find something to compare, if not true, then exit loop.
This might be, i.e. checking the key against a dictionary, or checking if elements exits, or any other logic that might be specific to your problem.
For example, here your problem is, the site keeps displaying page 255 for the latter pages. So this is a clue for us. We can compare an element that belongs to page (n) with an element that belongs to page (n-1).
For instance, if element in page 256 is the same as element in page 255, then exit loop/sub. Please see the sample code below:
Sub yify()
Const mlink = "https://www.yify-torrent.org/search/1080p/t-"
Dim http As New XMLHTTP60, html As New HTMLDocument
Dim post As Object, posts As Object
Dim pageno As Long, rowno As Long
pageno = 1
rowno = 1
Do
With http
.Open "GET", mlink & pageno & "/", False
.send
html.body.innerHTML = .responseText
End With
Set posts = html.getElementsByClassName("mv")
If Cells(rowno, 1) = posts(17).getElementsByTagName("a")(0).innerText Then Exit Do
For Each post In posts
With post.getElementsByTagName("div")
If .Length Then
rowno = rowno + 1
Cells(rowno, 1) = .Item(0).innerText
End If
End With
Next post
Debug.Print "pageno: " & pageno & " completed."
pageno = pageno + 1
Loop
End Sub
来源:https://stackoverflow.com/questions/45362363/how-to-get-all-the-links-leading-to-the-next-page