问题
I have this webpage. When I try to get its html using requests
module like this :
import requests
link = "https://www.worldmarktheclub.com/resorts/7m/"
f = requests.get(link)
print(f.text)
I get a result like this:
<!DOCTYPE html>
<html><head>
<meta http-equiv="Pragma" content="no-cache"/>
<meta http-equiv="Expires" content="-1"/>
<meta http-equiv="CacheControl" content="no-cache"/>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<link rel="shortcut icon" href="data:;base64,iVBORw0KGgo="/>
<script>
(function(){
var securemsg;
var dosl7_common;
// seemingly garbage like [Z.li]+Z._j+Z.LO+Z.SJ+"(/.{"+Z.i+","+Z.Ii+"}
</script>
<script type="text/javascript" src="/TSPD/08e841a5c5ab20007f02433a700e2faba779c2e847ad5d441605ef3d4bbde75cd229bcdb30078f66?type=9"></script>
<noscript>Please enable JavaScript to view the page content.</noscript>
</head><body>
</body></html>
Only a part of the result shown. But I can see the proper html when I inspect the webpage in a browser. I guess there might be an issue with the encoding of the page, but can't figure it out. Using urllib.request
+ read()
gives the same wrong result. How do I correct this. Thanks in advance.
As suggested by @DeepSpace, the garbage issue in script is due to the minified JS code. But why am I not getting the html correctly?
回答1:
What you deem as "garbage" is obfuscated/minified JS code that is written in <script>
tags instead of in an external JS file.
If you look at the bottom of f.text
, you will see <noscript>Please enable JavaScript to view the page content.</noscript>
.
requests
is not a browser, hence it can not execute JS code which this page is making use of, and the server will not allow user-agents who do not support JS to access it. Setting the User-Agent
header to Chrome's (Chrome/60.0.3112.90
) still does not work.
You will have to resort to other tools that allow JS execution, such as selenium.
回答2:
The HTML code is produced on the fly by the Javascript code you see. Unfortunately, as said by @DeepSpace, requests does not execute Javascript.
As an alternative I suggest to use selenium. It is a library which simulate a browser and so execute Javascript.
来源:https://stackoverflow.com/questions/53704406/seemingly-garbage-result-with-requests