Parse SGML with Open Arbitrary Tags in Python 3

后端 未结 1 867
猫巷女王i
猫巷女王i 2020-11-30 09:20

I am trying to parse a file such as: http://www.sec.gov/Archives/edgar/data/1409896/000118143112051484/0001181431-12-051484.hdr.sgml

I am using Python 3 and have bee

相关标签:
1条回答
  • 2020-11-30 09:38

    If you can find an SGML DTD for the documents that you work with, a solution could be to use the osx SGML to XML converter from the OpenSP SGML toolkit to turn the documents into XML.

    Here is a simple example. Let's say that we have the following SGML document (company.sgml; with a root element):

    <!DOCTYPE ROOT SYSTEM "company.dtd">
    <ROOT>
    <COMPANY>Awesome Corp
    <FORM> 24-7
    <ADDRESS>
    <STREET>101 PARSNIP LN
    <ZIP>31337
    </ADDRESS>
    

    The DTD (company.dtd) looks like this:

    <!ELEMENT ROOT       -  o (COMPANY, FORM, ADDRESS) >
    <!ELEMENT COMPANY    -  o (#PCDATA) >
    <!ELEMENT FORM       -  o (#PCDATA) >
    <!ELEMENT ADDRESS    -  - (STREET, ZIP) >
    <!ELEMENT STREET     -  o (#PCDATA) >
    <!ELEMENT ZIP        -  o (#PCDATA) >
    

    The - o bit means that the end tag can be omitted.

    The SGML document can be parsed with osx, and the output can be formatted with xmllint, as follows:

    osx company.sgml | xmllint --format -
    

    Output from the above command:

    <?xml version="1.0"?>
    <ROOT>
      <COMPANY>Awesome Corp</COMPANY>
      <FORM> 24-7</FORM>
      <ADDRESS>
        <STREET>101 PARSNIP LN</STREET>
        <ZIP>31337</ZIP>
      </ADDRESS>
    </ROOT>
    

    Now we have well-formed XML that can be processed with lxml or other XML tools.

    I don't know if there is a complete DTD for the document that you link to. The following PDF file contains related information about EDGAR, including a DTD that might be useful: http://www.sec.gov/info/edgar/pdsdissemspec910.pdf (I found it via this answer). But the linked SGML document contains elements (SEC-HEADER, for example) that are not mentioned in the PDF file.

    0 讨论(0)
提交回复
热议问题