I have a file which contains the text below.
#L_ENTRY
#LEX >
#ROOT >
#POS
#SUBCAT
There are two ways to do it. Firstly, you can set the "input record separator" special variable (see more here). In short, you are telling perl that a line is not terminated by a new-line char. In your case, you could set it to '#SYNONYM <0>'. Then when you read in one line, you get everything up to that point in the file that has that tag - if the tag is not there, then you get what's left in the file. So, for input data that looks like this;
#L_ENTRY
#LEX >
#ROOT >
#POS
#SUBCAT
#S_LINK <>
#BITS <>
#WEIGHT <0.1>
#SYNONYM <0>
#L_ENTRY
#LEX <,>
#ROOT <,>
#POS
#SUBCAT
#S_LINK <>
#BITS <>
#WEIGHT <0.1>
#SYNONYM <0>
if you run this;
use v5.14;
use warnings;
my $filename = "data.txt" ;
open(my $fh, '<', $filename) or die "$filename: $!" ;
local $/ = "#SYNONYM <0>\n" ;
my @chunks = <$fh> ;
say $chunks[0] ;
say '---' ;
say $chunks[1] ;
You get;
#L_ENTRY
#LEX >
#ROOT >
#POS
#SUBCAT
#S_LINK <>
#BITS <>
#WEIGHT <0.1>
#SYNONYM <0>
---
#L_ENTRY
#LEX <,>
#ROOT <,>
#POS
#SUBCAT
#S_LINK <>
#BITS <>
#WEIGHT <0.1>
#SYNONYM <0>
A couple of notes about this;
To get more control, it's better to process the data line-by-line and use regexs to switch between "capture" mode and "dont capture" mode:
use v5.14;
use warnings;
my $filename = "data.txt" ;
open(my $fh, '<', $filename) or die "$filename: $!" ;
my $found_start_token = qr/ \s* \#L_ENTRY \s* /x;
my $found_stop_token = qr/ \s* \#SYNONYM \s+ \<0\> \s* \n /x;
my @chunks ;
my $chunk ;
my $capture_mode = 0 ;
while ( <$fh> ) {
$capture_mode = 1 if /$found_start_token/ ;
$chunk .= $_ if $capture_mode ;
if (/$found_stop_token/) {
push @chunks, $chunk ;
$chunk = '' ;
$capture_mode = 0 ;
}
}
say $chunks[0] ;
say '---' ;
say $chunks[1] ;
exit 0
A couple of notes;
$_
, on to $chunk
if we're in caputure mode./x
. This allows adding whitespace to the regex for easier reading.