Adding missing XML closing tags in Javascript

旧巷老猫 提交于 2019-11-28 12:29:06

问题


I need to parse external files with the below structure using Node.js.

<ISSUER>
<COMPANY-DATA>
<CONFORMED-NAME>EXACTECH INC
<CIK>000012345
<ASSIGNED-SIC>9999
<IRS-NUMBER>8979898988
<STATE-OF-INCORPORATION>FL
<FISCAL-YEAR-END>1231
</COMPANY-DATA>
<BUSINESS-ADDRESS>
<STREET1>22W 56TH COURT
<CITY>GAINSVILLE
<STATE>FL
<ZIP>32653
<PHONE>999-999-9999
</BUSINESS-ADDRESS>
<MAIL-ADDRESS>
<STREET1>22W 56TH COURT
<CITY>GAINSVILLE
<STATE>FL
<ZIP>32653
</MAIL-ADDRESS>
</ISSUER>

The blocks have closing tags but individual lines do not. How can I add the missing closing tags so that I can parse the XML?

I do not have control over the XML file generation so cannot get it fixed at source.

This is similar to this Java implementation :Parsing XML with no closing tags in Java


回答1:


Your data looks like SGML, the superset of XML allowing tag inference/omission. I'm in the process of releasing an SGML parser for JavaScript (for the browser, node.js and other CommonJS platforms) but it's not released yet. For the time being, I suggest to use the venerable OpenSP software, which doesn't have an npm integration package, but which you can easily install on eg. Ubuntu/Debian using sudo apt-get install opensp, and similar on other Linuxen and on Mac OS via MacPorts.

The OpenSP package contains the osx command line utility to down-convert SGML to XML. You can use the node child_process core package to invoke the osx program, pipe it your SGML data, and grab the XML output produced by it, and then feed the produced XML to the XML parser of your choice in your node app.

SGML and the osx program must be told to add the omitted end-element tags for CONFORMED-NAME, CIK, and the other elements with omitted end-element tags. You do that by prepending a document type declaration (DTD) before your SGML content. In your case, what you supply to the osx program should look as follows:

<!DOCTYPE ISSUER [
  <!ELEMENT ISSUER - -
     (COMPANY-DATA,BUSINESS-ADDRESS,MAIL-ADDRESS)>
  <!ELEMENT COMPANY-DATA - -
     (CONFORMED-NAME,CIK,ASSIGNED-SIC,IRS-NUMBER,
       STATE-OF-INCORPORATION,FISCAL-YEAR-END)>
  <!ELEMENT (BUSINESS-ADDRESS,MAIL-ADDRESS) - -
     (STREET1,CITY,STATE,ZIP)>
  <!ELEMENT
     (CONFORMED-NAME,CIK,ASSIGNED-SIC,IRS-NUMBER,
       STATE-OF-INCORPORATION,FISCAL-YEAR-END,
       STREET1,CITY,STATE,ZIP) - O (#PCDATA)>
]>
<ISSUER> ... rest of your input data followin here

Crucially, the declaration for the CONFORMED-NAME, CIK, and the other field-like elements use - O (hyphen-minus and letter O) as tag omission indicators, telling SGML that the end-element tags for these elements can be omitted, and will be inserted automatically by the osx program.

You can read more about the meaning of these declarations on my project page at http://sgmljs.net/docs/sgmlrefman.html .



来源:https://stackoverflow.com/questions/50450793/adding-missing-xml-closing-tags-in-javascript

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!