How to get full web address with BeautifulSoup

问题

I cannot find how to get the full address of a web site: I get for example "/wiki/Main_Page" instead of "https://en.wikipedia.org/wiki/Main_Page". I cannot simply add url to the link as it would give :"https://en.wikipedia.org/wiki/WKIK/wiki/Main_Page" which is incorrect. My goal is to make it work for any website so I am looking for a general solution.

Here is the code :

from bs4 import BeautifulSoup
import requests

url ="https://en.wikipedia.org/wiki/WKIK"
r  = requests.get(url)
data = r.text
soup = BeautifulSoup(data)

for link in soup.find_all('a', href=True):
    print "Found the URL:", link['href']

Here is a part of what it returns :

>Found the URL: /wiki/WKIK_(AM)
>Found the URL: /wiki/WKIK-FM
>Found the URL: /wiki/File:Disambig_gray.svg
>Found the URL: /wiki/Help:Disambiguation
>Found the URL: //en.wikipedia.org/w/index.php?
>title=Special:WhatLinksHere/WKIK&namespace=0

回答1:

When you are taking links from element, href attribute .You will almost always get link like /wiki/Main_Page.

Because the base url is always the same 'https://en.wikipedia.org'. So what you need is to do is:

base_url = 'https://en.wikipedia.org'
search_url ="https://en.wikipedia.org/wiki/WKIK"
r  = requests.get(search_url)
data = r.content
soup = BeautifulSoup(data)

for link in soup.find_all('a', href=True):
    print ("Found the URL:", link['href'])
    if link['href'] != '#' and link['href'].strip() != '':
       final_url = base_url + link['href']

回答2:

maybe something like this will suit you:

for link in soup.find_all('a', href=True):
if 'en.wikipedia.org' not in link['href']:
    print("Found the URL:", 'https://en.wikipedia.org'+link['href'])
elif 'http' not in link['href']:
    print("Found the URL:", 'https://'+link['href'])
else:    
    print("Found the URL:", link['href'])

回答3:

The other answers here may run into issues with certain relative URLs, such as ones that include periods (../page).

Python's requests library has a function called urljoin to get the full URL:

requests.compat.urljoin(currentPage, link)

So if you're on https://en.wikipedia.org/wiki/WKIK and there's a link on the page with an href of /wiki/Main_Page, that function would return https://en.wikipedia.org/wiki/Main_Page.

来源：https://stackoverflow.com/questions/44746021/how-to-get-full-web-address-with-beautifulsoup

标签

python

beautifulsoup

web-crawler