问题
This sample python program:
document='''<p>This is <i>something</i>, it happens
in <b>real</b> life</p>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(document)
print(soup.prettify())
produces the following output:
<html>
<body>
<p>
This is
<i>
something
</i>
, it happens
in
<b>
real
</b>
life
</p>
</body>
</html>
That's wrong, because it adds whitespace before and after each opening and closing tag and, for example, there should be no space between </i> and ,. I would like it to:
Not add whitespace where there are none (even around block-level tags they could be problematic, if they are styled with
display:inlinein CSS.)Collapse all whitespace in a single space, except optionally for line wrapping.
Something like this:
<html>
<body>
<p>This is
<i>something</i>,
it happens in
<b>real</b> life</p>
</body>
</html>
Is this possible with BeautifulSoup? Any other recommended HTML parser that can deal with this?
回答1:
Beautiful Soup's .prettify() method is defined as outputting each tag on its own line (http://www.crummy.com/software/BeautifulSoup/bs4/doc/index.html#pretty-printing). If you want something else you'll need to make it yourself by walking the parse tree.
回答2:
Because of the habit of .prettify to put each tag in it's own line, it is not suitable for production code; it is only usable for debugging output, IMO. Just convert your soup to a string, using the str builtin function.
What you want is a change of the string contents in your tree; you could create a function to find all elements which contain sequences of two or more whitespace characters (using a pre-compiled regular expression), and then replace their contents.
BTW, you can have Python avoid the insertion of insignificant whitespace if you write your example like so:
document = ('<p>This is <i>something</i>, it happens '
'in <b>real</b> life</p>')
This way you have two literals which are implicitly concatinated.
回答3:
As previous comments and thebjorn stated, BeautifulSoup's definition of pretty html is with each tag on it's own line, however, to deal with some of your problems with the spacing of , and such, you can collapse it first like so:
from bs4 import BeautifulSoup
document = """<p>This is <i>something</i>, it happens
in <b>real</b> life</p>"""
document_stripped = " ".join(l.strip() for l in document.split("\n"))
soup = BeautifulSoup(document_stripped).prettify()
print(soup)
Which outputs this:
<html>
<body>
<p>
This is
<i>
something
</i>
, it happens in
<b>
real
</b>
life
</p>
</body>
</html>
来源:https://stackoverflow.com/questions/25514378/beautifulsoup-do-not-add-spaces-where-they-matter-remove-them-where-they-dont