Python urllib2 + Beautifulsoup

问题

So I'm struggling to implement beautiful into my current python project, Okay so to keep this plain and simple I'll reduce the complexity of my current script.

Script without BeautifulSoup -

import urllib2

    def check(self, name, proxy):
        urllib2.install_opener(
            urllib2.build_opener(
                urllib2.ProxyHandler({'http': 'http://%s' % proxy}),
                urllib2.HTTPHandler()
                )
            )

        req = urllib2.Request('http://example.com' ,"param=1")
        try:
            resp = urllib2.urlopen(req) 
        except:
            self.insert()
        try:
            if 'example text' in resp.read()
               print 'success'

now of course the indentation is wrong, this is just sketch up of what I have going on, as you can in simple terms I'm sending a post request to " example.com " & then if example.com contains " example text " in resp.read print success.

But what I actually want is to check

if ' example ' in resp.read()

then output text inside td align from example.com request using

soup.find_all('td', {'align':'right'})[4]

Now the way I'm implementing beautifulsoup isn't working, example of this -

import urllib2
from bs4 import BeautifulSoup as soup

main_div = soup.find_all('td', {'align':'right'})[4]

    def check(self, name, proxy):
        urllib2.install_opener(
            urllib2.build_opener(
                urllib2.ProxyHandler({'http': 'http://%s' % proxy}),
                urllib2.HTTPHandler()
                )
            )

        req = urllib2.Request('http://example.com' ,"param=1")
        try:
            resp = urllib2.urlopen(req) 
            web_soup = soup(urllib2.urlopen(req), 'html.parser')
        except:
            self.insert()
        try:
            if 'example text' in resp.read()
               print 'success' + main_div

Now you see I added 4 new lines/adjustments

from bs4 import BeautifulSoup as soup

web_soup = soup(urllib2.urlopen(url), 'html.parser')

main_div = soup.find_all('td', {'align':'right'})[4]

aswell as " + main_div " on print

However it just doesn't seem to be working, I've had a few errors whilst adjusting some of which have said " Local variable referenced before assignment " & " unbound method find_all must be called with beautifulsoup instance as first argument "

回答1:

Regarding your last code snippet:

from bs4 import BeautifulSoup as soup

web_soup = soup(urllib2.urlopen(url), 'html.parser')
main_div = soup.find_all('td', {'align':'right'})[4]

You should call find_all on the web_soup instance. Also be sure to define the url variable before you use it:

from bs4 import BeautifulSoup as soup

url = "url to be opened"
web_soup = soup(urllib2.urlopen(url), 'html.parser')
main_div = web_soup.find_all('td', {'align':'right'})[4]

来源：https://stackoverflow.com/questions/45629540/python-urllib2-beautifulsoup

标签

python

beautifulsoup

urllib2