Deleting a line from a huge file in Perl

安稳与你 提交于 2019-11-27 09:33:51

A file is a sequence of bytes. We can replace (overwrite) some of them, but how would we remove them? Once a file is written its bytes cannot be 'pulled out' of the sequence or 'blanked' in any way. (The ones at the end of the file can be dismissed, by truncating the file as needed.)

The rest of the content has to move 'up', so that what follows the text to be removed overwrites it. We have to rewrite the rest of the file. In practice it is often far simpler to rewrite the whole file.

As a very basic example

use warnings 'all';
use strict;
use File::Copy qw(move);

my $file_in = '...';
my $file_out = '...';  # best use `File::Temp`

open my $fh_in,  '<', $file_in  or die "Can't open $file_in: $!";
open my $fh_out, '>', $file_out or die "Can't open $file_out: $!";

# Remove a line with $pattern
my $pattern = qr/this line goes/;

while (<$fh_in>) 
{
    print $fh_out $_  unless /$pattern/;
}
close $fh_in;
close $fh_out;

# Rename the new fie into the original one, thus replacing it
move ($file_out, $file_in) or die "Can't move $file_out to $file_in: $!";

This writes every line of input file into the output file, unless a line matches a given pattern. Then that file is renamed, replacing the original (what does not involve data copy). See this topic in perlfaq5.

Since we really use a temporary file I'd recommend the core module File::Temp for that.


This may be made more efficient, but far more complicated, by opening in update '+<' mode so to overwrite only a portion of the file. You iterate until the line with the pattern, record (tell) its position and the line length, then copy all remaining lines in memory. Then seek back to the position minus length of that line, and dump the copied rest of the file, overwriting the line and all that follows it.

Note that now the data for the rest of the file is copied twice, albeit one copy is in memory. Going to this trouble may make sense if the line to be removed is far down a very large file. If there are more lines to remove this gets messier.


Writing out a new file and copying it over the original changes the file's inode number. That may be a problem for some tools or procedures, and if it is you can instead update the original by either

  • Once the new file is written out, open it for reading and open the original for writing. This clobbers the original file. Then read from the new file and write to the original one, thus copying the content back to the same inode. Remove the new file when done.

  • Open the original file in read-write mode ('+<') to start with. Once the new file is written, seek to the beginning of the original (or to the place from which to overwrite) and write to it the content of the new file. Remember to also set the end-of-file if the new file is shorter,

    truncate $fh, tell($fh); 
    

after copying is done. This requires some care and the first way is probably generally safer.

If the file weren't huge the new "file" can be "written" in memory, as an array or a string.

Use sed command from Linux command line in Perl:

my $return = `sed -i '3d' text.txt`;

Where "3d" means delete the 3rd row.

It is useful to look at perlrun and see how perl itself modifies a file 'in-place.'

Given:

$ cat text.txt
This is fist line
This is second line
This is third line
This is fourth line
This is fifth line

You can apparently 'modify in-place', sed like, by using the -i and -p switch to invoke Perl:

$ perl -i -pe 's/This is third line\s*//' text.txt
$ cat text.txt
This is fist line
This is second line
This is fourth line
This is fifth line

But if you consult the Perl Cookbook recipe 7.9 (or look at perlrun) you will see that this:

$ perl -i -pe 's/This is third line\s*//' text.txt

is equivalent to:

while (<>) {
    if ($ARGV ne $oldargv) {           # are we at the next file?
        rename($ARGV, $ARGV . '.bak');
        open(ARGVOUT, ">$ARGV");       # plus error check
        select(ARGVOUT);
        $oldargv = $ARGV;
    }
    s/This is third line\s*//;
}
continue{
    print;
}
select (STDOUT);                      # restore default output
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!