Using StAX to create index for XML for quick access

前端 未结 2 2092
孤街浪徒
孤街浪徒 2020-12-01 15:26

Is there a way to use StAX and JAX-B to create an index and then get quick access to an XML file?

I have a large XML file and I need to find information in it. This

相关标签:
2条回答
  • 2020-12-01 15:56

    I just had to solve this problem, and spent way too much time figuring it out. Hopefully the next poor soul who comes looking for ideas can benefit from my suffering.

    The first problem to contend with is that most XMLStreamReader implementations provide inaccurate results when you ask them for their current offsets. Woodstox however seems to be rock-solid in this regard.

    The second problem is the actual type of offset you use. You have to use char offsets if you need to work with a multi-byte charset, which means the random-access retrieval from the file using the provided offsets is not going to be very efficient - you can't just set a pointer into the file at your offset and start reading, you have to read through until you get to the offset (that's what skip does under the covers in a Reader), then start extracting. If you're dealing with very large files, that means retrieval of content near the end of the file is too slow.

    I ended up writing a FilterReader that keeps a buffer of byte offset to char offset mappings as the file is read. When we need to get the byte offset, we first ask Woodstox for the char offset, then get the custom reader to tell us the actual byte offset for the char offset. We can get the byte offset from the beginning and end of the element, giving us what we need to go in and surgically extract the element from the file by opening it as a RandomAccessFile, which means it's super fast at any point in the file.

    I created a library for this, it's on GitHub and Maven Central. If you just want to get the important bits, the party trick is in the ByteTrackingReader.

    Some people have commented about how this whole thing is a bad idea and why would you want to do it? XML is a transport mechanism, you should just import it to a DB and work with the data with more appropriate tools. For most cases this is true, but if you're building applications or integrations that communicate via XML, you need tooling to analyze and operate on the files that are exchanged. I get daily requests to verify feed contents, having the ability to quickly extract a specific set of items from a massive file and verify not only the contents, but the format itself is essential.

    Anyhow, hopefully this can save someone a few hours, or at least get them closer to a solution.

    0 讨论(0)
  • 2020-12-01 16:08

    You could work with a generated XML parser using ANTLR4.

    The Following works very well on a ~17GB Wikipedia dump /20170501/dewiki-20170501-pages-articles-multistream.xml.bz2 but I had to increase heap size using -xX6GB.

    1. Get XML Grammar

    cd /tmp
    git clone https://github.com/antlr/grammars-v4
    

    2. Generate Parser

    cd /tmp/grammars-v4/xml/
    mvn clean install
    

    3. Copy Generated Java files to your Project

    cp -r target/generated-sources/antlr4 /path/to/your/project/gen
    

    4. Hook in with a Listener to collect character offsets

    package stack43366566;
    
    import java.util.ArrayList;
    import java.util.List;
    
    import org.antlr.v4.runtime.ANTLRFileStream;
    import org.antlr.v4.runtime.CommonTokenStream;
    import org.antlr.v4.runtime.tree.ParseTreeWalker;
    
    import stack43366566.gen.XMLLexer;
    import stack43366566.gen.XMLParser;
    import stack43366566.gen.XMLParser.DocumentContext;
    import stack43366566.gen.XMLParserBaseListener;
    
    public class FindXmlOffset {
    
        List<Integer> offsets = null;
        String searchForElement = null;
    
        public class MyXMLListener extends XMLParserBaseListener {
            public void enterElement(XMLParser.ElementContext ctx) {
                String name = ctx.Name().get(0).getText();
                if (searchForElement.equals(name)) {
                    offsets.add(ctx.start.getStartIndex());
                }
            }
        }
    
        public List<Integer> createOffsets(String file, String elementName) {
            searchForElement = elementName;
            offsets = new ArrayList<>();
            try {
                XMLLexer lexer = new XMLLexer(new ANTLRFileStream(file));
                CommonTokenStream tokens = new CommonTokenStream(lexer);
                XMLParser parser = new XMLParser(tokens);
                DocumentContext ctx = parser.document();
                ParseTreeWalker walker = new ParseTreeWalker();
                MyXMLListener listener = new MyXMLListener();
                walker.walk(listener, ctx);
                return offsets;
            } catch (Exception e) {
                throw new RuntimeException(e);
            }
        }
    
        public static void main(String[] arg) {
            System.out.println("Search for offsets.");
            List<Integer> offsets = new FindXmlOffset().createOffsets("/tmp/dewiki-20170501-pages-articles-multistream.xml",
                            "page");
            System.out.println("Offsets: " + offsets);
        }
    
    }
    

    5. Result

    Prints:

    Offsets: [2441, 10854, 30257, 51419 ....

    6. Read from Offset Position

    To test the code I've written class that reads in each wikipedia page to a java object

    @JacksonXmlRootElement
    class Page {
       public Page(){};
       public String title;
    }
    

    using basically this code

    private Page readPage(Integer offset, String filename) {
            try (Reader in = new FileReader(filename)) {
                in.skip(offset);
                ObjectMapper mapper = new XmlMapper();
                 mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
                Page object = mapper.readValue(in, Page.class);
                return object;
            } catch (Exception e) {
                throw new RuntimeException(e);
            }
        }
    

    Find complete example on github.

    0 讨论(0)
提交回复
热议问题