using beautifulsoup 4 for xml causes strange behaviour (memory issues?)

问题

I'm getting strange behaviour with this

>>> from bs4 import BeautifulSoup

>>> smallfile = 'small.xml'      #approx 600bytes
>>> largerfile = 'larger.xml'    #approx 2300 bytes
>>> len(BeautifulSoup(open(smallfile, 'r'), ['lxml', 'xml']))
1
>>> len(BeautifulSoup(open(largerfile, 'r'), ['lxml', 'xml']))
0

Contents of small.xml:

<?xml version="1.0" encoding="us-ascii"?>
<Catalog>
<CMoverMissile id="HunterSeekerMissile">
<MotionPhases index="1">
<Driver value="Guidance"/>
<Acceleration value="3200"/>
<MaxSpeed value="2.9531"/>
<Clearance value="0.5"/>
<ClearanceLookahead value="3"/>
<Outro value="-4.5,-4.25"/>
<YawPitchRoll value="MAX"/>
</MotionPhases>
<MotionPhases index="2">
<Driver value="Guidance"/>
<Acceleration value="4"/>
<MaxSpeed value="2.9531"/>
<Clearance value="0.5"/>
<ClearanceLookahead value="3"/>
<Outro value="-2.25,-2"/>
<YawPitchRoll value="MAX"/>
</MotionPhases>
</CMoverMissile>
</Catalog>

largerfile is simply the smaller file, but padded with spaces and newlines (inbetween the last two tags in case it's relevant). IE the structure and contents of the xml should be identical for both cases.

On rare occasions processing largerfile will actually yield a partial result where only a small portion of the xml has been parsed. I can't seem to reliably recreate the circumstances.

Since BeautifulSoup uses lxml, I tested to see if lxml could handle the files independently. lxml appeared to be able to parse both files.

>>> from lxml import etree
>>> tree = etree.parse(smallfile)
>>> len(etree.tostring(tree))
547
>>> tree = etree.parse(largerfile)
>>> len(etree.tostring(tree))
2294

I'm using

netbook with 1gb ram
windows 7
lxml 2.3 (had some trouble installing this, I hope a dodgy installation isn't causing the problem)
beautiful soup 4.0.1
python 3.2 (I also have python 2.7x installed, but have been using 3.2 for this code)

What could be preventing the larger file from being processed properly? My current suspicion is some weird memory issue, since the file size seems to make a difference, perhaps in conjunction with some bug in how BeautifulSoup 4 interacts with lxml.

Edit: to better illustrate...

>>> smallsoup = BeautifulSoup(smallfile), ['lxml', 'xml'])
>>> smallsoup
<?xml version="1.0" encoding="utf-8"?>
<Catalog>
<CMoverMissile id="HunterSeekerMissile">
<MotionPhases index="1">
<Driver value="Guidance"/>
<Acceleration value="3200"/>
<MaxSpeed value="2.9531"/>
<Clearance value="0.5"/>
<ClearanceLookahead value="3"/>
<Outro value="-4.5,-4.25"/>
<YawPitchRoll value="MAX"/>
</MotionPhases>
<MotionPhases index="2">
<Driver value="Guidance"/>
<Acceleration value="4"/>
<MaxSpeed value="2.9531"/>
<Clearance value="0.5"/>
<ClearanceLookahead value="3"/>
<Outro value="-2.25,-2"/>
<YawPitchRoll value="MAX"/>
</MotionPhases>
</CMoverMissile>
</Catalog>
>>> largersoup = BeautifulSoup(largerfile, ['lxml', 'xml'])
>>> largersoup
<?xml version="1.0" encoding="utf-8"?>

>>>

>>> repr(open(largefile, 'r').read())
'\'<?xml version="1.0" encoding="us-ascii"?>\\n<Catalog>\\n<CMoverMissile id="HunterSeekerMissile">\\n<MotionPhases index="1">\\n<Driver value="Guidance"/>\\n<Acceleration value="3200"/>\\n<MaxSpeed value="2.9531"/>\\n<Clearance value="0.5"/>\\n<ClearanceLookahead value="3"/>\\n<Outro value="-4.5,-4.25"/>\\n<YawPitchRoll value="MAX"/>\\n</MotionPhases>\\n<MotionPhases index="2">\\n<Driver value="Guidance"/>\\n<Acceleration value="4"/>\\n<MaxSpeed value="2.9531"/>\\n<Clearance value="0.5"/>\\n<ClearanceLookahead value="3"/>\\n<Outro value="-2.25,-2"/>\\n<YawPitchRoll value="MAX"/>\\n</MotionPhases>\\n</CMoverMissile>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        </Catalog>\''

note: there are many spaces (which probably won't show up in the browser) between and \''

回答1:

len(soup) returns len(soup.contents) i.e., the number of immediate children (in this case a single child <Catalog>).

BeautifulSoup fails to parse largerfile so len(soup) == 0

回答2:

It turns out the problem lies somewhere with BS4/LXML. The author of BS4 (BeautifulSoup), recognises the problem (https://groups.google.com/group/beautifulsoup/browse_thread/thread/24a82209aca4c083):

"Apparently BS4+lxml won't parse an XML document that's longer than about 550 bytes. I only tested it with small documents. The BS4 handler code is not even being called, which makes it hard to debug, but it's not a guarantee the problem is on the lxml side."

A slight tweak to J.F.Sebastian helpful code sample gives the size at which the code fails:

>>> from bs4 import BeautifulSoup
>>> from itertools import count
>>> for n in count():
    s = "<a>" + " " * n + "</a>"
    nchildren = len(BeautifulSoup(s, 'xml'))
    if nchildren != 1: # broken
       print(len(s)) 
       break

1092

The code processes the xml as expected for a character count of less than or equal to 1091. XML of a string longer than or equal to 1092 usually fails.

UPDATE: BeautifulSoup 4.0.2 has been released with a workaround:

"This new version works around what appears to be a bug in lxml's XMLParser.feed(), which was preventing BS from parsing XML documents larger than about 512-1024 characters. "

回答3:

after I checked that it seems that running len on a beautifulsoup object doesn't return the byte length but some other kind of property (node depth or something else.. not quite sure)

来源：https://stackoverflow.com/questions/9837713/using-beautifulsoup-4-for-xml-causes-strange-behaviour-memory-issues

标签

python

memory

beautifulsoup

lxml