How do I keep track of parsing progress of large files in StAX?

↘锁芯ラ 提交于 2019-12-22 17:58:08

问题


I'm processing large (1TB) XML files using the StAX API. Let's assume we have a loop handling some elements:

XMLInputFactory fac = XMLInputFactory.newInstance();
 XMLStreamReader reader = fac.createXMLStreamReader(new FileReader(inputFile));
   while (true) {
       if (reader.nextTag() == XMLStreamConstants.START_ELEMENT){
            // handle contents
       }
}

How do I keep track of overall progress within the large XML file? Fetching the offset from reader works fine for smaller files:

int offset = reader.getLocation().getCharacterOffset();

but being an Integer offset, it'll probably only work for files up to 2GB...


回答1:


A simple FilterReader should work.

class ProgressCounter extends FilterReader {
    long progress = 0;

    @Override
    public long skip(long n) throws IOException {
        progress += n;
        return super.skip(n);
    }

    @Override
    public int read(char[] cbuf, int off, int len) throws IOException {
        int red = super.read(cbuf, off, len);
        progress += red;
        return red;
    }

    @Override
    public int read() throws IOException {
        int red = super.read();
        progress += red;
        return red;
    }

    public ProgressCounter(Reader in) {
        super(in);
    }

    public long getProgress () {
        return progress;
    }
}



回答2:


Seems that the Stax API can't give you a long offset.

As a workaround you could create a custom java.io.FilterReader class which overrides read() and read(char[] cbuf, int off, int len) to increment a long offset.

You would pass this reader to the XMLInputFactory. The handler loop can then get the offset information directly from the reader.

You could also do this on the byte-level reading using a FilterInputStream, counting the byte offset instead of character offset. That would allow for a exact progress calculation given the file size.



来源:https://stackoverflow.com/questions/34724494/how-do-i-keep-track-of-parsing-progress-of-large-files-in-stax

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!