How to Parse Big (50 GB) XML Files in Java

后端 未结 2 1522
-上瘾入骨i
-上瘾入骨i 2020-12-07 21:16

Currently im trying to use a SAX Parser but about 3/4 through the file it just completely freezes up, i have tried allocating more memory etc but not getting any improvement

2条回答
  •  粉色の甜心
    2020-12-07 21:51

    Don Roby's approach is somewhat reminiscent to the approach I followed creating a code generator designed to solve this particular problem (an early version was conceived in 2008). Basically each complexType has its Java POJO equivalent and handlers for the particular type are activated when the context changes to that element. I used this approach for SEPA, transaction banking and for instance discogs (30GB). You can specify what elements you want to process at runtime, declaratively using a propeties file.

    XML2J uses mapping of complexTypes to Java POJOs on the one hand, but lets you specify events you want to listen on. E.g.

    account/@process = true
    account/accounts/@process = true
    account/accounts/@detach = true
    

    The essence is in the third line. The detach makes sure individual accounts are not added to the accounts list. So it won't overflow.

    class AccountType {
        private List accounts = new ArrayList<>();
    
        public void addAccount(AccountType tAccount) {
            accounts.add(tAccount);
        }
        // etc.
    };
    

    In your code you need to implement the process method (by default the code generator generates an empty method:

    class AccountsProcessor implements MessageProcessor {
        static private Logger logger = LoggerFactory.getLogger(AccountsProcessor.class);
    
        // assuming Spring data persistency here
        final String path = new ClassPathResource("spring-config.xml").getPath();
        ClassPathXmlApplicationContext context = new   ClassPathXmlApplicationContext(path);
        AccountsTypeRepo repo = context.getBean(AccountsTypeRepo.class);
    
    
        @Override
        public void process(XMLEvent evt, ComplexDataType data)
            throws ProcessorException {
    
            if (evt == XMLEvent.END) {
                if( data instanceof AccountType) {
                    process((AccountType)data);
                }
            }
        }
    
        private void process(AccountType data) {
            if (logger.isInfoEnabled()) {
                // do some logging
            }
            repo.save(data);
        }
    }   
    

    Note that XMLEvent.END marks the closing tag of an element. So, when you are processing it, it is complete. If you have to relate it (using a FK) to its parent object in the database, you could process the XMLEvent.BEGIN for the parent, create a placeholder in the database and use its key to store with each of its children. In the final XMLEvent.END you would then update the parent.

    Note that the code generator generates everything you need. You just have to implement that method and of course the DB glue code.

    There are samples to get you started. The code generator even generates your POM files, so you can immediately after generation build your project.

    The default process method is like this:

    @Override
    public void process(XMLEvent evt, ComplexDataType data)
        throws ProcessorException {
    
    
    /*
     *  TODO Auto-generated method stub implement your own handling here.
     *  Use the runtime configuration file to determine which events are to be sent to the processor.
     */ 
    
        if (evt == XMLEvent.END) {
            data.print( ConsoleWriter.out );
        }
    }
    

    Downloads:

    • https://github.com/lolkedijkstra/xml2j-core
    • https://github.com/lolkedijkstra/xml2j-gen
    • https://sourceforge.net/projects/xml2j/

    First mvn clean install the core (it has to be in the local maven repo), then the generator. And don't forget to set up the environment variable XML2J_HOME as per directions in the usermanual.

提交回复
热议问题