How can I extract the list of urls obtained during a HTML page render in python?

后端未结

关注

 2  2036

谎友^ 2021-01-22 01:59

I want to be able to get the list of all URLs that a browser will do a GET request for when we try to open a page. For eg: if we try to open cnn.com, there are multiple URLs wit

2条回答

暗喜 (楼主)

2021-01-22 03:02
It's likely that you'll have to render the page (not necessarily display it though) to be sure you're getting a complete list of all resources. I've used PyQT and QtWebKit in similar situations. Especially when you start counting resources included dynamically with javascript, trying to parse and load pages recursively with BeautifulSoup just isn't going to work.

Ghost.py is an excellent client to get you started with PyQT. Also, check out the QWebView docs and the QNetworkAccessManager docs.

Ghost.py returns a tuple of (page, resources) when opening a page:
```
from ghost import Ghost
ghost = Ghost()
page, resources = ghost.open('http://my.web.page')
```
resources includes all of the resources loaded by the original URL as HttpResource objects. You can retrieve the URL for a loaded resource with resource.url.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...