Scrape the absolute URL instead of a relative path in python

后端 未结 3 1442
没有蜡笔的小新
没有蜡笔的小新 2020-12-10 11:20

I\'m trying to get all the href\'s from a HTML code and store it in a list for future processing such as this:

Example URL: www.example-page-xl.com

         


        
相关标签:
3条回答
  • 2020-12-10 11:36

    I see the solution mentioned here to be the most robust.

    import urllib.parse
    
    def base_url(url, with_path=False):
        parsed = urllib.parse.urlparse(url)
        path   = '/'.join(parsed.path.split('/')[:-1]) if with_path else ''
        parsed = parsed._replace(path=path)
        parsed = parsed._replace(params='')
        parsed = parsed._replace(query='')
        parsed = parsed._replace(fragment='')
        return parsed.geturl()
    
    0 讨论(0)
  • 2020-12-10 11:42

    In this case urlparse.urljoin helps you. You should modify your code like this-

    import bs4 as bs4
    import urllib.request
    from urlparse import  urljoin
    
    web_url = 'https:www.example-page-xl.com'
    sauce = urllib.request.urlopen(web_url).read()
    soup = bs.BeautifulSoup(sauce,'lxml')
    
    section = soup.section
    
    for url in section.find_all('a'):
        print urljoin(web_url,url.get('href'))
    

    here urljoin manage absolute and relative paths.

    0 讨论(0)
  • 2020-12-10 11:50

    urllib.parse.urljoin() might help. It does a join, but it is smart about it and handles both relative and absolute paths. Note this is python 3 code.

    >>> import urllib.parse
    >>> base = 'https://www.example-page-xl.com'
    
    >>> urllib.parse.urljoin(base, '/helloworld/index.php') 
    'https://www.example-page-xl.com/helloworld/index.php'
    
    >>> urllib.parse.urljoin(base, 'https://www.example-page-xl.com/helloworld/index.php')
    'https://www.example-page-xl.com/helloworld/index.php'
    
    0 讨论(0)
提交回复
热议问题