问题
I have to read some quite heavy XML files (between 200 MB and 1 GB) that are, for some of them, invalid. Let me give you a small example :
<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0">
<item>
<title>Some article</title>
<g:material><ul><li>50 % Coton</li><li>50% Lyocell</li></g:material>
</item>
</rss>
Obviously, there is a missing </ul>
closing tag in the g:material
tag. Moreover, people that have developed this feed should have enclosed g:material
content into CDATA
, which they did not... Basically, that's what I want to do : add this missing CDATA
section.
I've tried to use a SAX parser to read this file but it fails when reading the </g:material>
tag since the </ul>
tag is missing. I've tried with XMLReader but got basically the same issue.
I could probably do something with DomDocument::loadHtml but the size of this file is not really compatible with a DOM approach.
Do you have any idea how I could simply repair this feed without having to buy lots of RAM for DomDocument to work ?
Thanks.
回答1:
If the files are too large to use the Tidy extension, you can use the tidy CLI tool to make the files parseable.
$ tidy -output my.clean.xml my.xml
After that, the XML files are well-formed, so you can parse them using the XMLReader. Since tidy adds the 'missing' (X)HTML parts, your original document's code is inside the element.
回答2:
(copy from https://stackoverflow.com/a/17903058/287948)
Summarizing as two steps:
- Use Tidy to transform "free HTML" into "good XHTML".
- Use XML Parser to parse XHTML as XML by SAX API.
Use first Tidy (!), to transform "free HTML" into XHTML (or when you can not trust your "supposed XHTML"). See cleanRepair method. It needs more time, but runs with big files (!)... Set some minutes as maximum execution time if too big.
Another option (for work with big files) is to cache your XHTML files after checked or transformed into XHTML. See Tidy's repairfile method.
With a "trusted XHTML", use SAX... How to use SAX with PHP?
Parse XML with a SAX standard API, that in PHP is implemented by LibXML (see LibXML2 at xmlsoft.org), and its interface is the PHP's XML Parser, that is near to the SAX standard API.
Another way to use the "SAX of LibXML2", with another interface (a PHP iterator instead the traditional SAX interface), is to use XMLReader. See this explanation about "XMLReader use SAX".
Yes, the terms "SAX" or "SAX API" not expressed in the PHP manual (!). See this old but good introduction.
来源:https://stackoverflow.com/questions/15679103/php-read-and-repair-big-invalid-xml-files