Python to parse non-standard XML file

前端 未结 3 1961
野趣味
野趣味 2021-01-05 11:24

My input file is actually multiple XML files appending to one file. (It\'s from Google Patents). It has below structure:



        
3条回答
  •  臣服心动
    2021-01-05 12:02

    I'd opt for parsing each chunk of XML separately.

    You seem to already be doing that in your sample code. Here's my take on your code:

    def parse_xml_buffer(buffer):
        dom = minidom.parseString("".join(buffer))  # join list into string of XML
        # .... parse dom ...
    
    buffer = [file.readline()]  # initialise with the first line
    for line in file:
        if line.startswith("

    Once you've broken the file down to individual XML blocks, how you actually do the parsing depends on your requirements and, to some extent, your preference. Options are lxml, minidom, elementtree, expat, BeautifulSoup, etc.


    Update:

    Starting from scratch, here's how I would do it (using BeautifulSoup):

    #!/usr/bin/env python
    from BeautifulSoup import BeautifulSoup
    
    def separated_xml(infile):
        file = open(infile, "r")
        buffer = [file.readline()]
        for line in file:
            if line.startswith("

    This returns:

    D0629996
    29316765
    D471343
    D475175
    6715152
    D498899
    D558952
    D571528
    D577177
    D584027
    .... (lots more)...
    

提交回复
热议问题