Strategy for parsing LOTS and LOTS of not-so-well formed SGML / XML documents

南楼画角 提交于 2019-12-02 02:23:38

The issue is that you're trying to parse SGML with an XML tool. They're not the same. If you want to use an XML tool/language to access the data, you will probably need to convert the SGML to XML before trying to parse it.

Ideally you'd either use a language/tool that supports SGML (like OmniMark) or something that can handle "XML like" data (like nokogiri from the first answer?).

This can be pretty straight forward, but can get tricky at some points. Especially if you're talking about multiple doctypes (DTD's). (Also, there's no such thing as "well-formed" SGML. Yes, the elements/etc. have to be nested correctly but SGML has to have a DTD.)

Here are some differences between SGML and XML that you'd need to handle. (You may not want to go this route, but it may be helpful for informational purposes anyway.):

  1. DOCTYPE declaration

    The DOCTYPE declaration in your example is a perfectly valid SGML doctype. The [] (internal subset) doesn't have to have anything in it. If you do have declarations in the internal subset (usually entity declarations), you're more than likely going to have to keep a doctype declaration in the XML.

    The issue the XML parser is having is that you don't have a system identifier in the declaration. In an XML doctype declaration, the system identifier is required if there is a public identifier. In an SGML doctype declaration, it's not required.

    Bottom line: unless you need the XML to parse to a DTD/Schema or have declarations in the internal subset, strip the doctype declaration. If the XML does have to be valid, you'll at least need to add a system identifier. Don't forget to add the <?xml ...?> processing instruction.

  2. Elements without end tags

    The <hardhyphen> and <hyphen> elements are valid SGML. SGML DTD's allow you to specify tag minimization. What this means is that you can specify whether or not an end tag is required. (You can also make the start tag optional, but that's crazy talk.) In XML you have to close these elements (like <hardhyphen/> or <hardhyphen></hardhyphen>)

    The best thing to do is to look at your SGML DTD and see what elements have optional end tags. The tag minimization is specified right after the element name in the element declaration. A '-' means the tag is required. An 'o' (letter 'oh') means that the tag is optional. For example if you see <!ELEMENT hyphen - o (#PCDATA)>, this means that the start tag is required (-) and the end tag is optional (o). If you see <!ELEMENT hyphen - - (#PCDATA)>, both the start and the end tags are required.

    Bottom line: properly close all of the elements that don't have end tags

  3. Processing instructions

    Processing instructions (PI's) in SGML don't have the second ? when they are closed like XML does. You'll need to add the second ?.

    Example SGML PI: <?asdf jkl>

    Example XML PI: <?asdf jkl?>

  4. Inclusions/Exclusions

    You probably won't have to worry about this, but in an SGML DTD you can specify in an element declaration that another element is allowed anywhere inside of that element (or not allowed). This can be a pain if your target XML needs to parse to a DTD; XML DTD's do not allow inclusions/exclusions.

    This is what an inclusion might look like:

    <!ELEMENT chapter - - (section)+ +(revst|revend)>

    This is saying that revst or revend are allowed anywhere inside of chapter. If the element declaration had -(revst|revend), it would mean that revst or revend is not allowed anywhere inside of chapter.

Hope this helps.

ma11hew28

Yeah, use Nokogiri.

Scroll down a bit on that page and copy the code under "Synopsis" into a file, say xml-parser.rb. Then, if you're on a Mac (Ruby comes already installed on Macs.), from Terminal, run gem install nokogiri, and then run the file with: ruby xml-parser.rb.

You can also then type irb right from Terminal and then require 'nokogiri' and start playing around with the nokogiri api in real time. Gotta love interactive Ruby. :)

If you're on Windows, try this Ruby installer for Windows.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!