Scrape the absolute URL instead of a relative path in python

后端 未结 3 1445
没有蜡笔的小新
没有蜡笔的小新 2020-12-10 11:20

I\'m trying to get all the href\'s from a HTML code and store it in a list for future processing such as this:

Example URL: www.example-page-xl.com

         


        
3条回答
  •  北荒
    北荒 (楼主)
    2020-12-10 11:42

    In this case urlparse.urljoin helps you. You should modify your code like this-

    import bs4 as bs4
    import urllib.request
    from urlparse import  urljoin
    
    web_url = 'https:www.example-page-xl.com'
    sauce = urllib.request.urlopen(web_url).read()
    soup = bs.BeautifulSoup(sauce,'lxml')
    
    section = soup.section
    
    for url in section.find_all('a'):
        print urljoin(web_url,url.get('href'))
    

    here urljoin manage absolute and relative paths.

提交回复
热议问题