Clicking links with Python BeautifulSoup

六眼飞鱼酱① 提交于 2021-01-27 17:47:44

问题


So I'm new to Python (I come from a PHP/JavaScript background), but I just wanted to write a quick script that crawled a website and all children pages to find all a tags with href attributes, count how many there are and then click the link. I can count all of the links, but I can't figure out how to "click" the links and then return the response codes.

from bs4 import BeautifulSoup
import urllib2
import re

def getLinks(url):
    html_page = urllib2.urlopen(url)
    soup = BeautifulSoup(html_page, "html.parser")
    links = []

    for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
        links.append(link.get('href'))
    return links

anchors = getLinks("http://madisonmemorial.org/")
# Click on links and return responses
countMe = len(anchors)
for anchor in anchors:
    i = getLinks(anchor)
    countMe += len(i)
    # Click on links and return responses

print countMe

Is this even possible with BeautifulSoup?
Also, I'm not looking for exact code, all I'm really looking for is like a point in the right direction for function calls to use or something like that. Thanks!


回答1:


Urlopen is a better solution for your purpose but if you need to click and interact with elements on the web I suggest using selenium webdriver. There are implementations for Java, Python, and other languages. I've used it with Java and Python, works pretty good. You can run it headless so the browser doesn't actually open.

pip install selenium



回答2:


BeautifulSoup is merely a DOM/HTML Parser, it doesn't constitute a real or in your case emulated browser. For that purpose you could use Chrome or Selenium to emulate a real browser and crawl freely, which gives you the advantage of handling Javascript, however when that's not needed, you can use the widely available package requests to recursively crawl all links:

for link in links:
  body = requests.get(link).text



回答3:


So with help from the comments, I decided to just use urlopen like this:

from bs4 import BeautifulSoup
import urllib.request
import re

def getLinks(url):
    html_page = urllib.request.urlopen(url)
    soup = BeautifulSoup(html_page, "html.parser")
    links = []

    for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
        links.append(link.get('href'))
    return links

anchors = getLinks("http://madisonmemorial.org/")
for anchor in anchors:
    happens = urllib.request.urlopen(anchor)
    if happens.getcode() == "404":
        # Do stuff
# Click on links and return responses
countMe = len(anchors)
for anchor in anchors:
    i = getLinks(anchor)
    countMe += len(i)
    happens = urllib.request.urlopen(i)
    if happens.getcode() == "404":
        # Do some stuff

print(countMe)

I've got my own arguments in the if statements



来源:https://stackoverflow.com/questions/45701424/clicking-links-with-python-beautifulsoup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!