sgml

How to parse a OFX (Version 1.0.2) file in PHP?

流过昼夜 提交于 2021-02-07 12:26:22
问题 I have a OFX file downloaded from Citibank, this file has a DTD defined at http://www.ofx.net/DownloadPage/Files/ofx102spec.zip (file OFXBANK.DTD), the OFX file appear to be SGML valid. I'm trying with DomDocument of PHP 5.4.13, but I get several warning and file is not parsed. My Code is: $file = "source/ACCT_013.OFX"; $dtd = "source/ofx102spec/OFXBANK.DTD"; $doc = new DomDocument(); $doc->loadHTMLFile($file); $doc->schemaValidate($dtd); $dom->validateOnParse = true; The OFX file start as:

Is > ever necessary?

旧巷老猫 提交于 2020-01-14 07:18:48
问题 I now develop websites and XML interfaces since 7 years, and never, ever came in a situation, where it was really necessary to use the > for a > . All disambiguition could so far be handled by quoting < , & , " and ' alone. Has anyone ever been in a situation (related to, e.g., SGML processing, browser issues, XSLT, ...) where you found it indespensable to escape the greater-than sign with > ? Update: I just checked with the XML spec, where it says, for example, about character data in

Strategy for parsing LOTS and LOTS of not-so-well formed SGML / XML documents

我只是一个虾纸丫 提交于 2019-12-31 03:24:06
问题 I have thousands of SGML documents, some well-formed, some not so well-formed. I need to get at certain ELEMENTS in the documents, but everytime I go to load and try to read them into an XDocument, XMLDocument, or even just a StreamReader, I get different various XMLException errors. Things like "'[' is an unexpected token.". Why? Because I have a document with DOCTYPE like <!DOCTYPE RChapter PUBLIC "-//LSC//DTD R Chapter for Authoring//EN" [] > and I have learned that the "[]" needs to have

Which ASCII characters are forbidden for use in SGML attributes?

点点圈 提交于 2019-12-23 22:06:44
问题 Apart from whitespace, quotation mark, equal sign, and tab, which other characters of the printable subset of ASCII are forbidden to be used as attribute names in SGML? 回答1: By default, SGML allows only alphanumeric values for SGML names. What additional characters are allowed for SGML names is controlled by the SGML declaration; specifically UCNMCHAR and LCNMCHAR under NAMING . For example, if you look at the SGML declaration for HTML 4, you'll see: LCNMCHAR ".-_:" UCNMCHAR ".-_:" This means

get contents of <a> tags using python

橙三吉。 提交于 2019-12-21 19:57:52
问题 Assuming I have html read into my program like this: <p><a href="http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html">F/T & P/T Sales Associate - Caliente Fashions</a> - <font size="-1"> (North Vancouver)</font></p> <p><a href="http://vancouver.en.craigslist.ca/van/ret/1817804151.html">IMMEDIATE EMPLOYMENT WANTED!</a> - </p> <p><a href="http://vancouver.en.craigslist.ca/nvn/ret/1817796152.html">TRAVEL AGENT</a> - <font size="-1"> (NORTH VANCOUVER)</font></p> <p><a href="http://vancouver

Parsing EDGAR filings

人走茶凉 提交于 2019-12-18 12:38:16
问题 I would like to use python2.7 to remove anything that isn't the documents' text from EDGAR filings (which are available online as .txt files). An example of what the files look like is here: Example EDGAR provides its Document Type Definitions starting on page 48 of this file: DTD The first part of my program gets the .txt file from the EDGAR online database into a local file that I've named "parseme.txt". What I would like to know is how to use the DTD to parse the .txt file. I would use a

Definition of HTML whitespace rules?

浪尽此生 提交于 2019-12-17 14:00:34
问题 I'm looking for this definition to make my HTML renderer conform a bit better. Currently it's guessing which whitespace to keep, which to collapse and what to throw. The SGML standard is hard to find and the HTML standard doesn't seem to treat the subject with the required depth for my needs. Currently my renderer parses the HTML into a tree and then does a recursive layout pass to position all the elements and their content. I'm experimenting with throwing some whitespace out in the parse

Parse SGML with Open Arbitrary Tags in Python 3

你离开我真会死。 提交于 2019-12-17 10:59:17
问题 I am trying to parse a file such as: http://www.sec.gov/Archives/edgar/data/1409896/000118143112051484/0001181431-12-051484.hdr.sgml I am using Python 3 and have been unable to find a solution with existing libraries to parse an SGML file with open tags. SGML allows implicitly closed tags. When attempting to parse the example file with LXML, XML, or beautiful soup I end up with implicitly closed tags being closed at the end of the file instead of at the end of line. For example: <COMPANY

Querying Non-XML compliant structured data

强颜欢笑 提交于 2019-12-13 03:23:35
问题 As a data analyst, I am constantly running across files with structured data that are in some proprietary format and resist normal XML parsing. For example, I have an archive of about a hundred documents that all begin with this: <!DOCTYPE DOCUMENT PUBLIC "-//Gale Research//DTD Document V2.0//EN"> I have included an abridged example of the document below, don't read it if you're offended by cloning. At any rate, is there a way to query this without having DTD or namespace or URI or whatever

Comments inside HTML/SGML/XML/DTD declarations

笑着哭i 提交于 2019-12-12 03:22:45
问题 In the W3C HTML 4.01 DTDs and earlier, inline comments are frequently used within declarations. For example, the HTML 2.0 Strict DTD has: <!ENTITY % HTML.Version "-//IETF//DTD HTML 2.0 Strict//EN" -- Typical usage: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML Strict//EN"> <html> ... </html> -- > where the HTML entity declaration contains a comment between two double hyphens -- . However, DTD validators seem to flat out reject these sorts of internal comments and throw an error. Are the validators