Passing lxml output to BeautifulSoup

核能气质少年 提交于 2019-12-24 13:33:15

问题


My offline code works fine but I'm having trouble passing a web page from urllib via lxml to BeautifulSoup. I'm using urllib for basic authentication then lxml to parse (it gives a good result with the specific pages we need to scrape) then to BeautifulSoup.

#! /usr/bin/python
import urllib.request 
import urllib.error 
from io import StringIO
from bs4 import BeautifulSoup 
from lxml import etree 
from lxml import html 

file = open("sample.html")
doc = file.read()
parser = etree.HTMLParser()
html = etree.parse(StringIO(doc), parser)
result = etree.tostring(html.getroot(), pretty_print=True, method="html")
soup = BeautifulSoup(result)
# working perfectly

With that working, I tried to feed it a page via urllib:

# attempt 1
page = urllib.request.urlopen(req)
doc = page.read()
# print (doc)
parser = etree.HTMLParser()
html = etree.parse(StringIO(doc), parser)
# TypeError: initial_value must be str or None, not bytes

Trying to deal with the error message, I tried:

# attempt 2
html = etree.parse(bytes.decode(doc), parser)
#OSError: Error reading file

I didn't know what to do about the OSError so I sought another method. I found suggestions to use lxml.html instead of lxml.etree so the next attempt is:

attempt 3
page = urllib.request.urlopen(req)
doc = page.read()
# print (doc)
html = html.document_fromstring(doc)
print (html)
# <Element html at 0x140c7e0>
soup = BeautifulSoup(html) # also tried (html, "lxml")
# TypeError: expected string or buffer

This clearly gives a structure of some sort, but how to pass it to BeautifulSoup? My question is twofold: How can I pass a page from urllib to lxml.etree (as in attampt 1, closest to my working code)? or, How can I pass a lxml.html structure to BeautifulSoup (as above)? I understand that both revolve around datatypes but don't know what to do about them.

python 3.3, lxml 3.0.1, BeautifulSoup 4. I'm new to python. Thanks to the internet for code fragments and examples.


回答1:


BeautifulSoup can use the lxml parser directly, no need to go to these lengths.

BeautifulSoup(doc, 'lxml')


来源:https://stackoverflow.com/questions/13830303/passing-lxml-output-to-beautifulsoup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!