PHP - Read and repair big invalid XML files

南楼画角 提交于 2019-12-12 10:43:51

问题


I have to read some quite heavy XML files (between 200 MB and 1 GB) that are, for some of them, invalid. Let me give you a small example :

<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:g="http://base.google.com/ns/1.0" version="2.0">
  <item>
    <title>Some article</title>
    <g:material><ul><li>50 % Coton</li><li>50% Lyocell</li></g:material>
  </item>
</rss>

Obviously, there is a missing </ul> closing tag in the g:material tag. Moreover, people that have developed this feed should have enclosed g:material content into CDATA, which they did not... Basically, that's what I want to do : add this missing CDATA section.

I've tried to use a SAX parser to read this file but it fails when reading the </g:material> tag since the </ul> tag is missing. I've tried with XMLReader but got basically the same issue. I could probably do something with DomDocument::loadHtml but the size of this file is not really compatible with a DOM approach. Do you have any idea how I could simply repair this feed without having to buy lots of RAM for DomDocument to work ? Thanks.


回答1:


If the files are too large to use the Tidy extension, you can use the tidy CLI tool to make the files parseable.

$ tidy -output my.clean.xml my.xml

After that, the XML files are well-formed, so you can parse them using the XMLReader. Since tidy adds the 'missing' (X)HTML parts, your original document's code is inside the element.




回答2:


(copy from https://stackoverflow.com/a/17903058/287948)

Summarizing as two steps:

  1. Use Tidy to transform "free HTML" into "good XHTML".
  2. Use XML Parser to parse XHTML as XML by SAX API.

Use first Tidy (!), to transform "free HTML" into XHTML (or when you can not trust your "supposed XHTML"). See cleanRepair method. It needs more time, but runs with big files (!)... Set some minutes as maximum execution time if too big.

Another option (for work with big files) is to cache your XHTML files after checked or transformed into XHTML. See Tidy's repairfile method.

With a "trusted XHTML", use SAX... How to use SAX with PHP?

Parse XML with a SAX standard API, that in PHP is implemented by LibXML (see LibXML2 at xmlsoft.org), and its interface is the PHP's XML Parser, that is near to the SAX standard API.

Another way to use the "SAX of LibXML2", with another interface (a PHP iterator instead the traditional SAX interface), is to use XMLReader. See this explanation about "XMLReader use SAX".


Yes, the terms "SAX" or "SAX API" not expressed in the PHP manual (!). See this old but good introduction.



来源:https://stackoverflow.com/questions/15679103/php-read-and-repair-big-invalid-xml-files

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!