How to get all the links leading to the next page?

戏子无情 提交于 2019-12-24 08:00:17

问题


I've written some code in vba to get all the links leading to the next page from a webpage. The highest number of next page links is 255. Running my script, I get all the links within 6906 links. That means the loop runs again and again and I'm overwriting stuffs. Filtering out duplicate links I could see that 254 unique links are there. My objective here is not to hardcode the highest page number to the link for iteration. Here is what I'm trying with:

Sub YifyLink()
    Const link = "https://www.yify-torrent.org/search/1080p/"
    Dim http As New XMLHTTP60, html As New HTMLDocument, htm As New HTMLDocument
    Dim x As Long, y As Long, item_link as String

    With http
        .Open "GET", link, False
        .send
        html.body.innerHTML = .responseText
    End With

    For Each post In html.getElementsByClassName("pager")(0).getElementsByTagName("a")
        If InStr(post.innerText, "Last") Then
            x = Split(Split(post.href, "-")(1), "/")(0)
        End If
    Next post
    For y = 0 To x
        item_link = link & "t-" & y & "/"

        With http
            .Open "GET", item_link, False
            .send
            htm.body.innerHTML = .responseText
        End With
        For Each posts In htm.getElementsByClassName("pager")(0).getElementsByTagName("a")
            I = I + 1: Cells(I, 1) = posts.href
        Next posts
    Next y
End Sub

Elements within which the links are:

<div class="pager"><a href="/search/1080p/" class="current">1</a> <a href="/search/1080p/t-2/">2</a> <a href="/search/1080p/t-3/">3</a> <a href="/search/1080p/t-4/">4</a> <a href="/search/1080p/t-5/">5</a> <a href="/search/1080p/t-6/">6</a> <a href="/search/1080p/t-7/">7</a> <a href="/search/1080p/t-8/">8</a> <a href="/search/1080p/t-9/">9</a> <a href="/search/1080p/t-10/">10</a> <a href="/search/1080p/t-11/">11</a> <a href="/search/1080p/t-12/">12</a> <a href="/search/1080p/t-13/">13</a> <a href="/search/1080p/t-14/">14</a> <a href="/search/1080p/t-15/">15</a> <a href="/search/1080p/t-16/">16</a> <a href="/search/1080p/t-17/">17</a> <a href="/search/1080p/t-18/">18</a> <a href="/search/1080p/t-19/">19</a> <a href="/search/1080p/t-20/">20</a> <a href="/search/1080p/t-21/">21</a> <a href="/search/1080p/t-22/">22</a> <a href="/search/1080p/t-23/">23</a> <a href="/search/1080p/t-2/">Next</a> <a href="/search/1080p/t-255/">Last</a> </div>

The results I'm getting (partial portion):

about:/search/1080p/t-20/
about:/search/1080p/t-21/
about:/search/1080p/t-22/
about:/search/1080p/t-23/
about:/search/1080p/t-255/

回答1:


The idea should be to scrape pages in a loop and find something to compare, if not true, then exit loop.

This might be, i.e. checking the key against a dictionary, or checking if elements exits, or any other logic that might be specific to your problem.

For example, here your problem is, the site keeps displaying page 255 for the latter pages. So this is a clue for us. We can compare an element that belongs to page (n) with an element that belongs to page (n-1).

For instance, if element in page 256 is the same as element in page 255, then exit loop/sub. Please see the sample code below:

Sub yify()
Const mlink = "https://www.yify-torrent.org/search/1080p/t-"
Dim http As New XMLHTTP60, html As New HTMLDocument
Dim post As Object, posts As Object
Dim pageno As Long, rowno As Long

pageno = 1
rowno = 1

Do
    With http
        .Open "GET", mlink & pageno & "/", False
        .send
        html.body.innerHTML = .responseText
    End With

    Set posts = html.getElementsByClassName("mv")
    If Cells(rowno, 1) = posts(17).getElementsByTagName("a")(0).innerText Then Exit Do

    For Each post In posts
        With post.getElementsByTagName("div")
            If .Length Then
                rowno = rowno + 1
                Cells(rowno, 1) = .Item(0).innerText
            End If
        End With
    Next post
    Debug.Print "pageno: " & pageno & " completed."
    pageno = pageno + 1
Loop
End Sub


来源:https://stackoverflow.com/questions/45362363/how-to-get-all-the-links-leading-to-the-next-page

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!