Python requests isn't giving me the same HTML as my browser is

问题

I am grabbing a Wikia page using Python requests. There's a problem, though: the requests request isn't giving me the same HTML as my browser is with the very same page.

For comparison, here's the page Firefox gets me, and here's the page requests fetches (download them to view - sorry, no easy way to just visually host a bit of HTML from another site).

You'll note a few differences (super unfriendly diff). There are some small things, like attributes beinig ordered differently and such, but there are also a few very, very large things. Most important is the lack of the last six <img>s, and the entirety of the navigation and footer sections. Even in the raw HTML it looks like the page cut off abruptly.

Why is this happening, and is there a way to fix it? I've thought of a bunch of things already, none of which have been fruitful:

Request headers interfering? Nope, I tried copying the headers my browser sends, User-Agent and all, 1:1 into the requests request, but nothing changed.
JavaScript loading content after the HTML is loaded? Nah. Even with JS disabled, Firefox gives me the "good" page.
Uh... well... what else could there be?

It'd be amazing if you know a way this could happen and a way to fix it. Thank you!

回答1:

I had a similar issue:

Identical headers with Python and through the browser
JavaScript definitely ruled out as a cause

To resolve the issue, I ended up swapping out the requests library for urllib.request.

Basically, I replaced:

import requests

session = requests.Session()
r = session.get(URL)

with:

import urllib.request

r = urllib.request.urlopen(URL)

and then it worked.

Maybe one of those libraries is doing something strange behind the scenes? Not sure if that's an option for you or not.

回答2:

I suggest that you're not sending the proper header (or sending it wrong) with your request. That's why you are getting different content. Here is an example of a HTTP request with header:

url = 'https://www.google.co.il/search?q=eminem+twitter'
user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36'

# header variable
headers = { 'User-Agent' : user_agent }

# creating request
req = urllib2.Request(url, None, headers)

# getting html
html = urllib2.urlopen(req).read()

If you are sure that you are sending right header, but are still getting different html. You can try to use selenium. It will allows you to work with browser directly (or with phantomjs if your machine doesn't have GUI). With selenium you will be able just to grab html directly from browser.

回答3:

A lot of the differences I see are showing me that the content is still there, it's just rendered in a different order, sometimes with different spacing.

You could be receiving different content based on multiple different things:

Your headers
Your user agent
The time!
The order which the web application decides to render elements on the page, subject to random attribute order as the element may be pulled from an unsorted data source.

If you could include all of your headers at the top of that Diff, then we may be able to make more sense of it.

I suspect that the application chose not to render certain images as they aren't optimized for what it thinks is some kind of robot/mobile device (Python Requests)

On a closer look at the diff, it appears that everything was loaded in both requests, just with a different formatting.

回答4:

I was facing similar issue while requesting a page. Then I noticed that the URL which I was using required 'http' to be prepended to the URL but I was prepending 'https'. My request URL looked like https://example.com. So make the URL look like http://example.com. Hope it solves the problem.

回答5:

Maybe Requests and Browsers use different ways to render the raw data from WEB server, and the diff in the above example are only with the rendered html.

I found that when html is broken, different browsers, e.g. Chrome and Safari, use different ways to fix when parsing. So maybe it is the same idea with Requests and Firefox.

From both Requests and Firefox I suggest to diff the raw data, i.e. the byte stream in socket. Requests can use .raw property of response object to get the raw data in socket. (http://docs.python-requests.org/en/master/user/quickstart/) If the raw data from both sides are same and there are some broken codes in HTML, maybe it is due to the different auto-fixing policies of Request and browser when parsing broken html.

来源：https://stackoverflow.com/questions/29773528/python-requests-isnt-giving-me-the-same-html-as-my-browser-is

标签

python

browser

python-requests