How to get Python bs4 to work properly on XML?

问题

I'm trying to use Python and BeautifulSoup 4 (bs4) to convert Inkscape SVGs into an XML-like format for some proprietary software. I can't seem to get bs4 to correctly parse a minimal example. I need the parser to respect self-closing tags, handle unicode, and not add html stuff. I thought specifying the 'lxml' parser with selfClosingTags should do it, but nope! check it out.

#!/usr/bin/python
from __future__ import print_function
from bs4 import BeautifulSoup

print('\nbs4 mangled XML:')
print(BeautifulSoup('<x><c name="b1"><d value="a"/></c></x>',
    features = "lxml", 
    selfClosingTags = ('d')).prettify())

print('''\nExpected output:
<x>
 <c name="b1">
  <d value="a"/>
 </c>
</x>''')

This prints

bs4 mangled XML:
/usr/local/lib/python2.7/dist-packages/beautifulsoup4-4.4.1-py2.7.egg/bs4/__init__.py:112: UserWarning: BS4 does not respect the selfClosingTags argument to the BeautifulSoup constructor. The tree builder is responsible for understanding self-closing tags.
<html>
 <body>
  <x>
   <c name="b1">
    <d value="a">
    </d>
   </c>
  </x>
 </body>
</html>

Expected output:
<x>
 <c name="b1">
  <d value="a"/>
 </c>
</x>

I've reviewed related StackOverflow questions, and I am not finding the solution.

This question addresses the html boilerplate, but only for parsing subsections of html, not for parsing xml.

This question pertains to getting beautifulsoup 4 to respect self-closing tags, and has no accepted answers.

This question seems to indicate that passing the selfClosingTags argument should help, but as you can see this now generates a warning BS4 does not respect the selfClosingTags argument, and self-closing tags are mangled.

This question suggests that using "xml" (not "lxml") will cause empty tags to be automatically self-closing. This might work for my purposes, but applying the "xml" parser to my actual data fails because the files contain unicode, which the "xml" parser does not support.

Is "xml" different from "lxml", and is it in the standard that "xml" cannot support unicode, and "lxml" cannot contain self-closing tags? Perhaps I'm simply trying to do something that is forbidden?

回答1:

If you want the result to be output as xml then parse it as that. Your xml data can contain unicode, however, you need to declare the encoding:

#!/usr/bin/env python
# -*- encoding: utf8 -*-

The SelfClosingTags is no longer recognized. Instead, Beautiful Soup considers any empty tag to be an empty-element tag. If you add a child to an empty-element tag, it stops being an empty-element tag.

Changing your function to look like this should work (In addition to the encoding):

print(BeautifulSoup('<x><c name="b1"><d value="a®"/></c></x>',
    features = "xml").prettify())

Result:

<?xml version="1.0" encoding="utf-8"?>
<x>
 <c name="b1">
  <d value="aÂŽ"/>
 </c>
</x>

来源：https://stackoverflow.com/questions/36144192/how-to-get-python-bs4-to-work-properly-on-xml

标签

python

xml

unicode

beautifulsoup

bs4