How to transform huge xml files in java?

前端 未结 7 1308
失恋的感觉
失恋的感觉 2021-01-12 07:04

As the title says it, I have a huge xml file (GBs)

  
  
     ...    
    ...          


        
7条回答
  •  长情又很酷
    2021-01-12 07:07

    I made good experiences with STX (Streaming Transformations for XML). Basically, it is a streamed version of XSLT, well suited to parsing huge amounts of data with minimal memory footprint. It has an implementation in Java named Joost.

    It should be easy to come up with a STX transform that ignores all elements until the element matches a given XPath, copies that element and all its children (using an identity template within a template group), and continues to ignore elements until the next match.

    UPDATE

    I hacked together a STX transform that does what I understand you want. It mostly depends on STX-only features like template groups and configurable default templates.

    
        
            
        
        
        
    
    

    The pass-through="none" at the stx:transform configures the default templates (for nodes, attributes etc.) to produce no output, but process child elements. Then the stx:template matches the XPath element/child (this is the place where you put your match expression), it "processes self" in the "copy" group, meaning that the matching template from the group name="copy" is invoked on the current element. That group has pass-though="all", so the default templates copy their input and process child elements. When the element/child element is ended, control is passed back to the template that invoked process-self, and the following elements are ignored again. Until the template matches again.

    The following is an example input file:

    
        
        
        
            
                text1bold
            
        
        
            
                text2
                
                
                    yet more
                
            
        
    
    

    This is the corresponding output file:

    
    
                text1bold
            
                text2
                
                
                    yet more
                
            
    

    The unusual formatting is a result of skipping the text nodes containing newlines outside the child elements.

提交回复
热议问题