Using StAX to create index for XML for quick access

前端 未结 2 2097
孤街浪徒
孤街浪徒 2020-12-01 15:26

Is there a way to use StAX and JAX-B to create an index and then get quick access to an XML file?

I have a large XML file and I need to find information in it. This

2条回答
  •  予麋鹿
    予麋鹿 (楼主)
    2020-12-01 16:08

    You could work with a generated XML parser using ANTLR4.

    The Following works very well on a ~17GB Wikipedia dump /20170501/dewiki-20170501-pages-articles-multistream.xml.bz2 but I had to increase heap size using -xX6GB.

    1. Get XML Grammar

    cd /tmp
    git clone https://github.com/antlr/grammars-v4
    

    2. Generate Parser

    cd /tmp/grammars-v4/xml/
    mvn clean install
    

    3. Copy Generated Java files to your Project

    cp -r target/generated-sources/antlr4 /path/to/your/project/gen
    

    4. Hook in with a Listener to collect character offsets

    package stack43366566;
    
    import java.util.ArrayList;
    import java.util.List;
    
    import org.antlr.v4.runtime.ANTLRFileStream;
    import org.antlr.v4.runtime.CommonTokenStream;
    import org.antlr.v4.runtime.tree.ParseTreeWalker;
    
    import stack43366566.gen.XMLLexer;
    import stack43366566.gen.XMLParser;
    import stack43366566.gen.XMLParser.DocumentContext;
    import stack43366566.gen.XMLParserBaseListener;
    
    public class FindXmlOffset {
    
        List offsets = null;
        String searchForElement = null;
    
        public class MyXMLListener extends XMLParserBaseListener {
            public void enterElement(XMLParser.ElementContext ctx) {
                String name = ctx.Name().get(0).getText();
                if (searchForElement.equals(name)) {
                    offsets.add(ctx.start.getStartIndex());
                }
            }
        }
    
        public List createOffsets(String file, String elementName) {
            searchForElement = elementName;
            offsets = new ArrayList<>();
            try {
                XMLLexer lexer = new XMLLexer(new ANTLRFileStream(file));
                CommonTokenStream tokens = new CommonTokenStream(lexer);
                XMLParser parser = new XMLParser(tokens);
                DocumentContext ctx = parser.document();
                ParseTreeWalker walker = new ParseTreeWalker();
                MyXMLListener listener = new MyXMLListener();
                walker.walk(listener, ctx);
                return offsets;
            } catch (Exception e) {
                throw new RuntimeException(e);
            }
        }
    
        public static void main(String[] arg) {
            System.out.println("Search for offsets.");
            List offsets = new FindXmlOffset().createOffsets("/tmp/dewiki-20170501-pages-articles-multistream.xml",
                            "page");
            System.out.println("Offsets: " + offsets);
        }
    
    }
    

    5. Result

    Prints:

    Offsets: [2441, 10854, 30257, 51419 ....

    6. Read from Offset Position

    To test the code I've written class that reads in each wikipedia page to a java object

    @JacksonXmlRootElement
    class Page {
       public Page(){};
       public String title;
    }
    

    using basically this code

    private Page readPage(Integer offset, String filename) {
            try (Reader in = new FileReader(filename)) {
                in.skip(offset);
                ObjectMapper mapper = new XmlMapper();
                 mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
                Page object = mapper.readValue(in, Page.class);
                return object;
            } catch (Exception e) {
                throw new RuntimeException(e);
            }
        }
    

    Find complete example on github.

提交回复
热议问题