How to read CDATA from an XML file in Node Sax

人走茶凉 提交于 2019-12-13 16:51:09

问题


I have an XML structure like this:

<?xml version="1.0" encoding="utf-8"?>
<videos>
    <video>
        <id>47288</id>
        <thumbs>
            <thumb><![CDATA[http://foo.com/bar.jpg]]></thumb>
        </thumbs>
        <link><![CDATA[http://foo.com/bar.html]]></link>
        <title><![CDATA[Sample Title Here]]></title>
        <categories>
            <category><![CDATA[Cat1]]></category>
            <category><![CDATA[Cat2]]></category>
        </categories>
        <tags>
            <tag><![CDATA[Tag1]]></tag>
            <tag><![CDATA[Tag2]]></tag>
            <tag><![CDATA[Tag3]]></tag>
            <tag><![CDATA[Tag4]]></tag>
            <tag><![CDATA[Tag5]]></tag>
            <tag><![CDATA[Tag6]]></tag>
        </tags>
        <duration><![CDATA[9:57]]></duration>
        <pubDate><![CDATA[2013-12-17]]></pubDate>
    </video>
    // insert 200,000 more <video> entries here

No idea why this is all written as CDATA but there's not much I can do about it, it's the data I've been given. My code to read this massive (1.5gb) XML file is to stream it using fs to sax then to saxpath, like so:

var saxpath = require('saxpath')
var fs = require('fs')
var sax = require('sax')
var parseString = require('xml2js').parseString;
var util = require('util');

var saxParser = sax.createStream(true)
var streamer = new saxpath.SaXPath(saxParser, '/videos/video')

streamer.on('match', function(xml) {
    console.log(xml);
    parseString(xml, function (err, result) {
        var json1 = JSON.stringify(result);
        var json = JSON.parse(json1);
        console.log(util.inspect(json, false, null));
    });

});

fs.createReadStream('./xml/big_data_file.xml').pipe(saxParser)

However, when I get to the console.log(xml), it shows this:

<video>
    <id>620339</id>
    <thumbs>
        <thumb></thumb>
    </thumbs>
    <link></link>
    <title></title>
    <categories>
        <category></category>
        <category></category>
    </categories>
    <tags>
        <tag></tag>
        <tag></tag>
        <tag></tag>
        <tag></tag>
        <tag></tag>
        <tag></tag>
        <tag></tag>
    </tags>
    <duration></duration>
    <pubDate></pubDate>
</video>

No data inside whatsoever. There's no mention of CDATA in the Saxpath Docs, although I'm not sure if this is an issue with Saxpath or Sax itself.

Any ideas how I can remedy this?

Cheers!


回答1:


That's a limitation of SaXPath 0.5.4, v0.5.5 that was just pushed to npm now handles CDATA (see commit) as you would expect.

With the exact same code and the last version of SaXPath:

<video>
        <id>47288</id>
        <thumbs>
            <thumb><![CDATA[http://foo.com/bar.jpg]]></thumb>
        </thumbs>
        <link><![CDATA[http://foo.com/bar.html]]></link>
        <title><![CDATA[Sample Title Here]]></title>
        <categories>
            <category><![CDATA[Cat1]]></category>
            <category><![CDATA[Cat2]]></category>
        </categories>
        <tags>
            <tag><![CDATA[Tag1]]></tag>
            <tag><![CDATA[Tag2]]></tag>
            <tag><![CDATA[Tag3]]></tag>
            <tag><![CDATA[Tag4]]></tag>
            <tag><![CDATA[Tag5]]></tag>
            <tag><![CDATA[Tag6]]></tag>
        </tags>
        <duration><![CDATA[9:57]]></duration>
        <pubDate><![CDATA[2013-12-17]]></pubDate>
</video>

And the parsed result of xml2js:

{ video: 
   { id: [ '47288' ],
     thumbs: [ { thumb: [ 'http://foo.com/bar.jpg' ] } ],
     link: [ 'http://foo.com/bar.html' ],
     title: [ 'Sample Title Here' ],
     categories: [ { category: [ 'Cat1', 'Cat2' ] } ],
     tags: [ { tag: [ 'Tag1', 'Tag2', 'Tag3', 'Tag4', 'Tag5', 'Tag6' ] } ],
     duration: [ '9:57' ],
     pubDate: [ '2013-12-17' ] } }


来源:https://stackoverflow.com/questions/20673911/how-to-read-cdata-from-an-xml-file-in-node-sax

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!