bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml [duplicate]

问题

Can you please suggest a fix? It almost download all the images from imgur pages with one single image not sure why it is not working in this case and how to fix it?

elif 'imgur.com' in submission.url and not (submission.url.endswith('gif')
                        or submission.url.endswith('webm')
                        or submission.url.endswith('mp4')
                        or 'all' in submission.url
                        or '#' in submission.url
                        or '/a/' in submission.url):
                html_source = requests.get(submission.url).text # download the image's page
                soup = BeautifulSoup(html_source, "lxml")
                image_url = soup.select('img')[0]['src']
                if image_url.startswith('//'):
                image_url = 'http:' + image_url
                image_id = image_url[image_url.rfind('/') + 1:image_url.rfind('.')]
                try:
                image_file = urllib2.urlopen(image_url, timeout = 5)
                with open('/home/mona/computer_vision/image_retrieval/images/'+ category+ '/'+ 'imgur_'+ datetime.datetime.now().strftime('%y-%m-%d-%s') + image_url[-9:], 'wb') as output_image:
                        output_image.write(image_file.read())
                        except urllib2.URLError as e:
                        print(e)
                        continue

The error is:

[LOG] Done Getting http://i.imgur.com/FoCjtI7.jpg
submission id is: 1alffm
[LOG] Getting url:  http://sphotos-a.ak.fbcdn.net/hphotos-ak-ash4/217834_10151246341237704_484810759_n.jpg
HTTP Error 403: Forbidden
[LOG] Getting url:  http://imgur.com/xp386
Traceback (most recent call last):
  File "download_images.py", line 67, in <module>
    soup = BeautifulSoup(html_source, "lxml")
  File "/usr/lib/python2.7/dist-packages/bs4/__init__.py", line 155, in __init__
    % ",".join(features))
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

回答1:

Open a python shell and try the following:

from bs4 import BeautifulSoup
myHTML = "<html><head></heda><body><strong>Hi</strong></body></html>"
soup = BeautifulSoup(myHTML, "lxml")

Does that work, or same error? If same error, you're missing lxml. Install it:

pip install lxml

I'm going through the steps because you indicate that the script works for a good while before crashing, in which case, you can't be missing the parser?

Added by OP:

If you are using Python2.7 in Ubuntu/Debian, this worked for me:

$ sudo apt-get build-dep python-lxml
$ sudo pip install lxml 

Test it like:

mona@pascal:~/computer_vision/image_retrieval$ python
Python 2.7.6 (default, Jun 22 2015, 17:58:13) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import lxml

来源：https://stackoverflow.com/questions/39986835/bs4-featurenotfound-couldnt-find-a-tree-builder-with-the-features-you-requeste

标签

python

web-scraping

beautifulsoup

lxml

bs4