问题
I want to work with this page in Python: http://www.sothebys.com/en/search-results.html?keyword=degas%27
This is my code:
from bs4 import BeautifulSoup
import requests
page = requests.get('http://www.sothebys.com/en/search-results.html?keyword=degas%27')
soup = BeautifulSoup(page.content, "lxml")
print(soup)
I'm getting following output:
<html><head>
<title>Invalid URL</title>
</head><body>
<h1>Invalid URL</h1>
The requested URL "[no URL]", is invalid.<p>
Reference #9.8f4f1502.1494363829.5fae0e0e
</p></body></html>
I can open the page with my browser from the same machine and don't get any error message. When I use the same code with another URL the correct HTML content is fetched:
from bs4 import BeautifulSoup
import requests
page = requests.get('http://www.christies.com/lotfinder/searchresults.aspx?&searchtype=p&action=search&searchFrom=header&lid=1&entry=degas')
soup = BeautifulSoup(page.content, "lxml")
print(soup)
I also tested other URLs (reddit, google, ecommerce sites) and didn't encounter any issue. So, the same code works with one URL and with another one not. Where is the problem?
回答1:
This website blocks the requests not coming from any browser thus you get the Invalid URL error. Adding custom headers to the request works fine.
import requests
from bs4 import BeautifulSoup
ua = {"User-Agent":"Mozilla/5.0"}
url = "http://www.sothebys.com/en/search-results.html?keyword=degas%27"
page = requests.get(url, headers=ua)
soup = BeautifulSoup(page.text, "lxml")
print(soup)
回答2:
change your code as
soup = BeautifulSoup(page.text, "lxml")
If you are using page.content then converting byte array to string would help you out, but you should go with page.text
来源:https://stackoverflow.com/questions/43880195/web-scraping-with-python-3-6-and-beautifulsoup-getting-invalid-url