Which XML parser to use here?

只谈情不闲聊 提交于 2019-12-05 05:32:53

Fastest solution is by far a StAX parser, specially as you only need a specific subset of the XML file and you can easily ignore whatever isn't really necessary using StAX, while you would receive the event anyway if you were using a SAX parser.

But it's also a little bit more complicated than using SAX or DOM. One of these days I had to write a StAX parser for the following XML:

<?xml version="1.0"?>
<table>
    <row>
        <column>1</column>
        <column>Nome</column>
        <column>Sobrenome</column>
        <column>email@gmail.com</column>
        <column></column>
        <column>2011-06-22 03:02:14.915</column>
        <column>2011-06-22 03:02:25.953</column>
        <column></column>
        <column></column>
    </row>
</table>    

Here's how the final parser code looks like:

public class Parser {

private String[] files ;

public Parser(String ... files) {
    this.files = files;
}

private List<Inscrito> process() {

    List<Inscrito> inscritos = new ArrayList<Inscrito>();


    for ( String file : files ) {

        XMLInputFactory factory = XMLInputFactory.newFactory();

        try {

            String content = StringEscapeUtils.unescapeXml( FileUtils.readFileToString( new File(file) ) );

            XMLStreamReader parser = factory.createXMLStreamReader( new ByteArrayInputStream( content.getBytes() ) );

            String currentTag = null;
            int columnCount = 0;
            Inscrito inscrito = null;           

            while ( parser.hasNext() ) {

                int currentEvent = parser.next();

                switch ( currentEvent ) {
                case XMLStreamReader.START_ELEMENT: 

                    currentTag = parser.getLocalName();

                    if ( "row".equals( currentTag ) ) {
                        columnCount = 0;
                        inscrito = new Inscrito();                      
                    }

                    break;
                case XMLStreamReader.END_ELEMENT:

                    currentTag = parser.getLocalName();

                    if ( "row".equals( currentTag ) ) {
                        inscritos.add( inscrito );
                    }

                    if ( "column".equals( currentTag ) ) {
                        columnCount++;
                    }                   

                    break;
                case XMLStreamReader.CHARACTERS:

                    if ( "column".equals( currentTag ) ) {

                        String text = parser.getText().trim().replaceAll( "\n" , " "); 

                        switch( columnCount ) {
                        case 0:
                            inscrito.setId( Integer.valueOf( text ) );
                            break;
                        case 1:                         
                            inscrito.setFirstName( WordUtils.capitalizeFully( text ) );
                            break;
                        case 2:
                            inscrito.setLastName( WordUtils.capitalizeFully( text ) );
                            break;
                        case 3:
                            inscrito.setEmail( text );
                            break;
                        }

                    }

                    break;
                }

            }

            parser.close();

        } catch (Exception e) {
            throw new IllegalStateException(e);
        }           

    }

    Collections.sort(inscritos);

    return inscritos;

}

public Map<String,List<Inscrito>> parse() {

    List<Inscrito> inscritos = this.process();

    Map<String,List<Inscrito>> resultado = new LinkedHashMap<String, List<Inscrito>>();

    for ( Inscrito i : inscritos ) {

        List<Inscrito> lista = resultado.get( i.getInicial() );

        if ( lista == null ) {
            lista = new ArrayList<Inscrito>();
            resultado.put( i.getInicial(), lista );
        }

        lista.add( i );

    }

    return resultado;
}

}

The code itself is in portuguese but it should be straightforward for you to understand what it is, here's the repo on github.

If you're only extracting a small amount, consider looking into using XPath as this is somewhat simpler than trying to extract the whole document.

Note: I'm the EclipseLink JAXB (MOXy) lead, and a member of the JAXB 2 (JSR-222) expert group.

StAX (JSR-173) is generally the fastest way to parse XML, and Woodstox is know for being a fast StAX parser. In addition to parsing, you need to collect the XML data. This is where a combination of StAX and JAXB comes in handy.

To ensure that our JAXB implementation uses the Woodstox StAX implementation. Configure your environment to use Woodstox (this is as simple as adding Woodstox to your classpath). Create an instance of XMLStreamReader and pass that as the source that JAXB should unmarshal.

Either SAX or StAX could handle this with some complex work figuring out that you're at something you want, but for extracting a small set of things by explicit path, you might be best off with XPath.

Another potential tactic is to first filter to only the parts you want using XSLT and then parse with anything you like, as the result of the filter will be a much smaller document.

I think that you should use SAX or parser based on SAX. I'd recommend you apache Digester. SAX is event driven and does not store state. This is what you need here due to you have to extract only small part of the document (I guess one tag).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!