lxml / BeautifulSoup parser warning

问题

Using Python 3, I'm trying to parse ugly HTML (which is not under my control) by using lxml with BeautifulSoup as explained here: http://lxml.de/elementsoup.html

Specifically, I want to use lxml, but I'd like to use BeautifulSoup because like I said, it's ugly HTML and lxml will reject it on its own.

The link above says: "All you need to do is pass it to the fromstring() function:"

from lxml.html.soupparser import fromstring
root = fromstring(tag_soup)

So that's what I'm doing:

URL = 'http://some-place-on-the-internet.com'
html_goo = requests.get(URL).text
root = fromstring(html_goo)

It works in the sense that I can manipulate the HTML just fine after that. My problem is that every time I run the script, I receive this annoying warning:

/usr/lib/python3/dist-packages/bs4/__init__.py:166: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

To get rid of this warning, change this:

 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "html.parser")

  markup_type=markup_type))

My problem is perhaps obvious: I'm not instantiating BeautifulSoup myself. I've tried adding the proposed parameter to the fromstring function, but that just gives me the error: TypeError: 'str' object is not callable. Searches online have proven fruitless so far.

I'd like to get rid of that warning message. Help appreciated, thanks in advance.

回答1:

For others init like:

soup = BeautifulSoup(html_doc)

Use

soup = BeautifulSoup(html_doc, 'html.parser')

instead

回答2:

I had to read lxml's and BeautifulSoup's source code to figure this out.

I'm posting my own answer here, in case someone else may need it in the future.

The fromstring function in question is defined so:

def fromstring(data, beautifulsoup=None, makeelement=None, **bsargs):

The **bsargs arguments ends up being sent forward to the BeautifulSoup constructor, which is called like so (in another function, _parse):

tree = beautifulsoup(source, **bsargs)

The BeautifulSoup constructor is defined so:

def __init__(self, markup="", features=None, builder=None,
             parse_only=None, from_encoding=None, exclude_encodings=None,
             **kwargs):

Now, back to the warning in the question, which is recommending that the argument "html.parser" be added to BeautifulSoup's contructor. According to this, that would be the argument named features.

Since the fromstring function will pass on named arguments to BeautifulSoup's constructor, we can specify the parser by naming the argument to the fromstring function, like so:

root = fromstring(clean, features='html.parser')

Poof. The warning disappears.

回答3:

While using the BeautifulSoup, we always do the things like below:

[variable] = BeautifulSoup([contents you want to analyze])

Here is the problem:

If you have installed "lxml" before, BeautifulSoup will automatically notice that it used it as the praser. It's not the error, just a notification.

So how to remove it?

Just do this like below:

[variable] = BeautifulSoup([contents you want to analyze], features = "lxml")

"Based on the latest version of BeautifulSoup, 4.6.3"

Notice that different versions of BeautifulSoup have different ways, or the grammar, to add this pattern, just look at the notice message carefully.

Good luck!

来源：https://stackoverflow.com/questions/50045775/lxml-beautifulsoup-parser-warning

标签

python

python-3.x

beautifulsoup

lxml