So I was looking at some source code and I came across this bit of code
<a href="/folder/big/a.jpg">
That’s an absolute address for the current host. So if the HTML file is at http://example.com/foo/bar.html
, then applying the url /folder/big/a.jpg
will result in this:
http://example.com/folder/big/a.jpg
I.e. take the host name and apply the new path to it.
Python has the builtin urljoin function to perform this operation for you:
>>> from urllib.parse import urljoin
>>> base = 'http://example.com/foo/bar.html'
>>> href = '/folder/big/a.jpg'
>>> urljoin(base, href)
'http://example.com/folder/big/a.jpg'
For Python 2, the function is within the urlparse module.
from bs4 import BeautifulSoup
import requests
import lxml
r = requests.get("http://example.com")
url = r.url # this is base url
data = r.content # this is content of page
soup = BeautifulSoup(data, 'lxml')
temp_url = soup.find('a')['href'] # you need to modify this selector
if temp_url[0:7] == "http://" or temp_url[0:8] == "https://" : # if url have http://
url = temp_url
else:
url = url + temp_url
print url # this is your full url