python requests not getting full page

巧了我就是萌 提交于 2021-02-11 16:52:08

问题


"""THIS IS MY CODE """

import requests
from bs4 import BeautifulSoup
import random
from selenium import webdriver
url ="http://www.yopmail.com/en/?smith"
request = requests.get(url)
soup = BeautifulSoup(request.text, 'html5lib')
print(soup)

"""IT RETURNING THIS OUTPUT """

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
</head>
<body onload="document.getElementById('f').submit();">
<form action="." id="f" method="post">
<input id="yp" name="yp" type="hidden" value="XAQHlAwL5ZwL1ZQZlAGH3ZGV"/>
<input id="login" name="login" type="hidden" value="smith"/>
<input id="id" name="id" type="hidden" value=""/>
</form>
<noscript><br/><br/>  <strong>Your browser does not support javascript or it may be disabled</strong></noscript>

</body></html>

""" I WANT WHOLE SRC CODE INSTEAD OF THIS"""


回答1:


This happens because the request is getting the source before Javascript is executed. You can install requests-html and import HTMLSession from requests_html. Supported features:

  • Full JavaScript support!
  • CSS Selectors (a.k.a jQuery-style, thanks to PyQuery).
  • XPath Selectors, for the faint of heart.
  • Mocked user-agent (like a real web browser).
  • Automatic following of redirects.
  • Connection–pooling and cookie persistence.
  • The Requests experience you know and love, with magical parsing abilities.
  • Async Support

Example:

pip install requests-html

from requests_html import HTMLSession
from requests_html import AsyncHTMLSession

url2search = "https://******"
session = HTMLSession()
r = session.get(url2search)

Render for JS as:

r.html.render()

Note, the first time you ever run the render() method, it will download Chromium into your home directory (e.g. ~/.pyppeteer/). This only happens once. You may also need to install a few Linux packages to get pyppeteer working.

More details on this link.




回答2:


I'd rather wanted to write this as a comment than an answer, as I'm only giving you a hint, but I don't have enough reputation to write comments. So here are my two cents:

Notice the lines

<body onload="document.getElementById('f').submit();">
<form action="." id="f" method="post">

in that HTML source of yours. It might be a very basic protection against scraping attempts like you intend on doing, and it might be sufficient to change your usage of requests.get to requests.post instead; including changing GET-like parameter

/?smith

in the URL to a POST parameter instead.

But just as well you might encounter even more code afterwards that requires you to be able to use JavaScript, though. Check the other answer by Basu_C in that case.



来源:https://stackoverflow.com/questions/60416507/python-requests-not-getting-full-page

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!