Parsing broken XML with lxml.etree.iterparse

后端 未结 3 1118
眼角桃花
眼角桃花 2020-12-01 07:01

I\'m trying to parse a huge xml file with lxml in a memory efficient manner (ie streaming lazily from disk instead of loading the whole file in memory). Unfortunately, the

3条回答
  •  我在风中等你
    2020-12-01 07:29

    Edit your question, stating what happens (exact error message and traceback (copy/paste, don't type from memory)) to make you think that "bad unicode" is the problem.

    Get chardet and feed it your MySQL dump. Tell us what it says.

    Show us the first 200 to 300 bytes of your dump, using e.g. print repr(dump[:300])

    Update You wrote """As you can see, chardet thinks it is an ascii file, but there is a "\x1e" right in the middle of this example which is making lxml raise an exception."""

    I see no "bad unicode" here.

    chardet is correct. What makes you think that "\x1e" is not ASCII? It is an ASCII character, a C0 control character named "RECORD SEPARATOR".

    The error message says that you have an invalid character. That is also correct. The only control characters that are valid in XML are "\t", "\r" and "\n". MySQL should be grumbling about that and/or offering you a way of escaping it e.g. _x001e_ (yuk!)

    Given the context, it looks like that character could be deleted with no loss. You may wish to fix your database or you may wish to remove suchlike characters from your dump (after checking that they are all vanishable) or you may wish to choose a less picky and less volumnious output format than XML.

    Update 2 You presumably want to user iterparse() not because it's your end goal but because you want to save memory. If you used a format like CSV you wouldn't have a memory problem.

    Update 3 In response to a comment by @Purrell:

    try it yourself, dude. pastie.org/3280965

    Here's the contents of that pastie; it deserves preservation:

    from lxml.etree import etree
    
    data = '\t<p>The cafeteria rang with excited voices.  Our barbershop quartet, The Bell \r Tones was asked to perform at the local Home for the Blind in the next town.  We, of course, were glad to entertain such a worthy group and immediately agreed .  One wag joked, "Which uniform should we wear?"  followed with, "Oh, that\'s right, they\'ll never notice."  The others didn\'t respond to this, in fact, one said that we should wear the nicest outfit we had.</p><p>A small stage was set up for us and a pretty decent P.A. system was donated for the occasion.  The audience was made up of blind persons of every age, from the thirties to the nineties.  Some sported sighted companions or nurses who stood or sat by their side, sharing the moment equally.  I observed several German shepherds lying at their feet, adoration showing in their eyes as they wondered what was going on.  After a short introduction in which we identified ourselves, stating our voice part and a little about our livelihood, we began our program.  Some songs were completely familiar and others, called "Oh, yeah" songs, only the chorus came to mind.  We didn\'t mind at all that some sang along \x1e they enjoyed it so much.</p><p>In fact, a popular part of our program is when the audience gets to sing some of the old favorites.  The harmony parts were quite evident as they tried their voices to the different parts.  I think there was more group singing in the old days than there is now, but to blind people, sound and music is more important.   We received a big hand at the finale and were made to promise to return the following year.  Everyone was treated to coffee and cake, our quartet going around to the different circles of friends to sing a favorite song up close and personal.  As we approached a new group, one blind lady amazed me by turning to me saying, "You\'re the baritone, aren\'t you?"  Previously no one had ever been able to tell which singer sang which part but this lady was listening with her whole heart.</p><p>Retired portrait photographer.  Main hobby - quartet singing.</p>\n'
    
    magical_parser = etree.XMLParser(encoding='utf-8', recover=True)
    tree = etree.parse(StringIO(data), magical_parser)
    

    To get it to run, one import needs to be fixed, and another supplied. The data is monstrous. There is no output to show the result. Here's a replacement with the data cut down to the bare essentials. The 5 pieces of ASCII text (excluding < and >) that are all valid XML characters are replaced by t1, ..., t5. The offending \x1e is flanked by t2 and t3.

    [output wraps at column 80]
    Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win
    32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from lxml import etree
    >>> from cStringIO import StringIO
    >>> data = '
    <p>t1</p><p>t2\x1et3</p><p>t4 </p><p>t5</p>
    ' >>> magical_parser = etree.XMLParser(encoding='utf-8', recover=True) >>> tree = etree.parse(StringIO(data), magical_parser) >>> print(repr(tree.getroot().text)) '

    t1

    t2t3/ppt4/ppt5/p'

    Not what I'd call "recovery"; after the bad character, the < and > characters disappear.

    The pastie was in response to my question "What gives you the idea that encoding='utf-8' will solve his problem?". This was triggered by the statement 'There is however an "encoding" option which would have fixed your issue.' But encoding=ascii produces the same output. So does omitting the encoding arg. It's NOT an encoding problem. Case closed.

提交回复
热议问题