Using Beautiful Soup to get the full URL in source code

后端 未结 2 1536
误落风尘
误落风尘 2020-12-10 14:51

So I was looking at some source code and I came across this bit of code



        
相关标签:
2条回答
  • 2020-12-10 15:24
    <a href="/folder/big/a.jpg">
    

    That’s an absolute address for the current host. So if the HTML file is at http://example.com/foo/bar.html, then applying the url /folder/big/a.jpg will result in this:

    http://example.com/folder/big/a.jpg
    

    I.e. take the host name and apply the new path to it.

    Python has the builtin urljoin function to perform this operation for you:

    >>> from urllib.parse import urljoin
    >>> base = 'http://example.com/foo/bar.html'
    >>> href = '/folder/big/a.jpg'
    >>> urljoin(base, href)
    'http://example.com/folder/big/a.jpg'
    

    For Python 2, the function is within the urlparse module.

    0 讨论(0)
  • 2020-12-10 15:25
    from bs4 import BeautifulSoup
    import requests
    import lxml
    
    r = requests.get("http://example.com")
    
    url = r.url  # this is base url
    data = r.content  # this is content of page
    soup = BeautifulSoup(data, 'lxml')
    temp_url = soup.find('a')['href']  # you need to modify this selector
    
    if temp_url[0:7] == "http://" or temp_url[0:8] == "https://" :  # if url have http://
            url = temp_url
    else:
            url = url + temp_url
    
    
    print url  # this is your full url
    
    0 讨论(0)
提交回复
热议问题