Shell scripting - split xml into multiple files

后端未结

关注

 3  1615

Am trying to split a big xml file into multiple files, and have used the following code in AWK script.

// {
        rfile=\"fileItem\" count


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  再見小時候        
                
              
                            
                2020-12-21 22:18
              
            
            
                                                                       
First and foremost - you need a parser for this.

XML is a contextual data format. Regular expressions are not. So you can never make a regular expression base processing system actually work properly. 

It's just bad news

But parsers do exist, and they're quite easy to work with. I can give you a better example with a better data input. But I would use XML::Twig and perl to do this:

#!/usr/bin/env perl
use strict;
use warnings;

use XML::Twig;


#subroutine to extract and process the item
sub save_item {
   my ( $twig, $item ) = @_;
   #retrieve the id
   my $id = $item -> first_child_text('id'); 
   print "Got ID of $id\n";

   #create a new XML document for output. 
   my $new_xml = XML::Twig -> new;
   $new_xml -> set_root (XML::Twig::Elt -> new ( 'root' ));

   #cut and paste the item from the 'old' doc into the 'new'  
   #note - "cut" applies to in memory, 
   #not the 'on disk' copy. 
   $item -> cut;
   $item -> paste ( $new_xml -> root );

   #set XML params (not strictly needed but good style)
   $new_xml -> set_encoding ('utf-8');
   $new_xml -> set_xml_version ('1.0');

   #set output formatting
   $new_xml -> set_pretty_print('indented_a');

   print "Generated new XML:\n";
   $new_xml -> print;

   #open a file for output
   open ( my $output, '>', "item_$id.xml" ) or warn $!;
   print {$output} $new_xml->sprint;
   close ( $output ); 
}

#create a parser. 
my $twig = XML::Twig -> new ( twig_handlers => { 'fileItem' => \&save_item } );
#run this parser on the __DATA__ filehandle below.
#you probably want parsefile('some_file.xml') instead. 
   $twig -> parse ( \*DATA );


__DATA__
<xml>
<fileItem>
<id>12345</id>
<name>XXXXX</name>
</fileItem>
</xml>


With XML::Twig comes xml_split which may be suited to your needs
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  青春惊慌失措        
                
              
                            
                2020-12-21 22:20
              
            
            
                                                                       
If your XML is really that well formed and consistent then all you need is:

awk -F'[<>]' '
/<fileItem>/ { header="<?xml version=\"1.0\" encoding=\"UTF-8\"?>" ORS $0; next }
/<id> { close(out); out="item_" $3; $0=header ORS $0 }
{ print > out }
' file


The above is untested of course since you didn't provide sample input/output for us to test a possible solution against.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  半阙折子戏        
                
              
                            
                2020-12-21 22:23
              
            
            
                                                                       
I would not use getline. (I even read in an AWK book that it is not recommended to use it.) I think, using global variables for state it is even simpler. (Expressions with global variables may be used in patterns too.)

The script could look like this:

test-split-xml.awk:

/<fileItem>/ {
  collect = 1 ; buffer = "" ; file = "fileItem_"count".xml"
  ++count
}

collect > 0 {
  if (buffer != "") buffer = buffer"\n"
  buffer = buffer $0
}

collect > 0 && /<name>.+<\/name>/ {
  # cut "...<name>"
  i = index($0, "<name>") ; file = substr($0, i + 6)
  # cut "</name>..."
  i = index(file, "</name>") ; file = substr(file, 1, i - 1)
  file = file".xml"
}

/<\/fileItem>/ {
  collect = 0;
  print file
  print "<?xml version=\"1.0\" encoding=\"UTF-8\"?>" >file
  print buffer >file
}


I prepared some sample data for a small test:

test-split-xml.xml:

<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<top>
  <some>
    <fileItem>
      <id>1</id>
      <name>X1</name>
    </fileItem>
  </some>
  <fileItem>
    <id>2</id>
    <name>X2</name>
  </fileItem>
  <fileItem>
    <id>2</id>
    <!--name>X2</name-->
  </fileItem>
  <any> other input </any>
</top>


... and got the following output:

$ awk -f test-split-xml.awk test-split-xml.xml
X1.xml
X2.xml
fileItem_2.xml

$ more X1.xml 
<?xml version="1.0" encoding="UTF-8"?>
    <fileItem>
      <id>1</id>
      <name>X1</name>
    </fileItem>

$ more X2.xml
<?xml version="1.0" encoding="UTF-8"?>
  <fileItem>
    <id>2</id>
    <name>X2</name>
  </fileItem>

$ more fileItem_2.xml 
<?xml version="1.0" encoding="UTF-8"?>
  <fileItem>
    <id>2</id>
    <!--name>X2</name-->
  </fileItem>

$


The comment of tripleee is reasonable. Thus, such processing should be limited to personal usage because different (and legal) formattings of XML files could cause errors in this script processing.

As you will notice, there is no next in the whole script. This is intentionally.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复