BeautifulSoup: do not add spaces where they matter, remove them where they don't

吃可爱长大的小学妹 提交于 2020-01-11 06:43:08

问题


This sample python program:

document='''<p>This is <i>something</i>, it happens
               in <b>real</b> life</p>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(document)
print(soup.prettify())

produces the following output:

<html>
 <body>
  <p>
   This is
   <i>
    something
   </i>
   , it happens
               in
   <b>
    real
   </b>
   life
  </p>
 </body>
</html>

That's wrong, because it adds whitespace before and after each opening and closing tag and, for example, there should be no space between </i> and ,. I would like it to:

  1. Not add whitespace where there are none (even around block-level tags they could be problematic, if they are styled with display:inline in CSS.)

  2. Collapse all whitespace in a single space, except optionally for line wrapping.

Something like this:

<html>
 <body>
  <p>This is
   <i>something</i>,
   it happens in
   <b>real</b> life</p>
 </body>
</html>

Is this possible with BeautifulSoup? Any other recommended HTML parser that can deal with this?


回答1:


Beautiful Soup's .prettify() method is defined as outputting each tag on its own line (http://www.crummy.com/software/BeautifulSoup/bs4/doc/index.html#pretty-printing). If you want something else you'll need to make it yourself by walking the parse tree.




回答2:


Because of the habit of .prettify to put each tag in it's own line, it is not suitable for production code; it is only usable for debugging output, IMO. Just convert your soup to a string, using the str builtin function.

What you want is a change of the string contents in your tree; you could create a function to find all elements which contain sequences of two or more whitespace characters (using a pre-compiled regular expression), and then replace their contents.

BTW, you can have Python avoid the insertion of insignificant whitespace if you write your example like so:

document = ('<p>This is <i>something</i>, it happens '
            'in <b>real</b> life</p>')

This way you have two literals which are implicitly concatinated.




回答3:


As previous comments and thebjorn stated, BeautifulSoup's definition of pretty html is with each tag on it's own line, however, to deal with some of your problems with the spacing of , and such, you can collapse it first like so:

from bs4 import BeautifulSoup

document = """<p>This is <i>something</i>, it happens
               in <b>real</b> life</p>"""

document_stripped = " ".join(l.strip() for l in document.split("\n"))

soup = BeautifulSoup(document_stripped).prettify()

print(soup)

Which outputs this:

<html>
 <body>
  <p>
   This is
   <i>
    something
   </i>
   , it happens in
   <b>
    real
   </b>
   life
  </p>
 </body>
</html>


来源:https://stackoverflow.com/questions/25514378/beautifulsoup-do-not-add-spaces-where-they-matter-remove-them-where-they-dont

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!