How can I read multiple lines of a file into blocks in Perl?

后端未结

关注

 3  770

悲哀的现实 2021-01-29 05:57

I have a file which contains the text below.

#L_ENTRY    
#LEX        
#ROOT       
#POS        
#SUBCAT


      
      
        
          3条回答        

        
                    
            
            
                         
                
              
              
                
                   轮回少年
                                             
                
                
                (楼主)
            
              
              
                2021-01-29 06:22
              

            
            
                        
There are two ways to do it.  Firstly, you can set the "input record separator" special variable (see more here).  In short, you are telling perl that a line is not terminated by a new-line char.  In your case, you could set it to '#SYNONYM <0>'.  Then when you read in one line, you get everything up to that point in the file that has that tag - if the tag is not there, then you get what's left in the file.  So, for input data that looks like this;

#L_ENTRY        
#LEX         
#ROOT        
#POS         
#SUBCAT      
#S_LINK            <>
#BITS     <>
#WEIGHT      <0.1>
#SYNONYM     <0>

#L_ENTRY        
#LEX         <,>
#ROOT        <,>
#POS         
#SUBCAT      
#S_LINK            <>
#BITS     <>
#WEIGHT      <0.1>
#SYNONYM     <0>


if you run this;

use v5.14;
use warnings;

my $filename = "data.txt" ;
open(my $fh, '<', $filename) or die "$filename: $!" ;
local $/ = "#SYNONYM     <0>\n" ;
my @chunks = <$fh> ;
say $chunks[0] ;
say '---' ;
say $chunks[1] ;


You get;

#L_ENTRY        
#LEX         
#ROOT        
#POS         
#SUBCAT      
#S_LINK            <>
#BITS     <>
#WEIGHT      <0.1>
#SYNONYM     <0>

---

#L_ENTRY        
#LEX         <,>
#ROOT        <,>
#POS         
#SUBCAT      
#S_LINK            <>
#BITS     <>
#WEIGHT      <0.1>
#SYNONYM     <0>


A couple of notes about this;


Any extra data between your records is going to "get caught in the net" and end up at the start of each record;
The record separator itself is still part of the data and is at the end of each record.


To get more control, it's better to process the data line-by-line and use regexs to switch between "capture" mode and "dont capture" mode:

use v5.14;
use warnings;

my $filename = "data.txt" ;
open(my $fh, '<', $filename) or die "$filename: $!" ;

my $found_start_token = qr/ \s* \#L_ENTRY \s* /x;
my $found_stop_token  = qr/ \s* \#SYNONYM \s+ \<0\> \s* \n /x;

my @chunks ;
my $chunk  ;
my $capture_mode = 0 ;

while ( <$fh> )  {
    $capture_mode = 1 if /$found_start_token/ ;
    $chunk .= $_ if $capture_mode ;
    if (/$found_stop_token/) {
        push @chunks, $chunk ;
        $chunk = '' ;
        $capture_mode = 0 ;
    }
}
say $chunks[0] ;
say '---' ;
say $chunks[1] ;
exit 0


A couple of notes;


The program works by string concatenation of the current line, $_, on to $chunk if we're in caputure mode.
Capture mode is turned off and on using regexs in 'extended mode', /x.  This allows adding whitespace to the regex for easier reading.
Extra data between record will not appear in the chunks.
It produces the same output as before.

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它3个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复