问题
I need to check if one file is inside another file by bash script. For a given multiline pattern and input file.
Return value:
I want to receive status (how in grep command) 0 if any matches were found, 1 if no matches were found.
Pattern:
- multiline,
- order of lines is important (treated as a single block of lines),
- includes characters such as numbers, letters, ?, &, *, # etc.,
Explanation
Only the following examples should found matches:
pattern file1 file2 file3 file4
222 111 111 222 222
333 222 222 333 333
333 333 444
444
the following should't:
pattern file1 file2 file3 file4 file5 file6 file7
222 111 111 333 *222 111 111 222
333 *222 222 222 *333 222 222
333 333* 444 111 333
444 333 333
Here's my script:
#!/bin/bash
function writeToFile {
if [ -w "$1" ] ; then
echo "$2" >> "$1"
else
echo -e "$2" | sudo tee -a "$1" > /dev/null
fi
}
function writeOnceToFile {
pcregrep --color -M "$2" "$1"
#echo $?
if [ $? -eq 0 ]; then
echo This file contains text that was added previously
else
writeToFile "$1" "$2"
fi
}
file=file.txt
#1?1
#2?2
#3?3
#4?4
pattern=`cat pattern.txt`
#2?2
#3?3
writeOnceToFile "$file" "$pattern"
I can use grep command for all lines of pattern, but it fails with this example:
file.txt
#1?1
#2?2
#=== added line
#3?3
#4?4
pattern.txt
#2?2
#3?3
or even if you change lines: 2 with 3
file=file.txt
#1?1
#3?3
#2?2
#4?4
returning 0 when it should't.
How do I can fix it? Note that I prefer to use native installed programs (if this can be without pcregrep). Maybe sed or awk can resolve this problem?
回答1:
I have a working version using perl.
I thought I had it working with GNU awk
, but I didn't. RS=empty string splits on blank lines. See the edit history for the broken awk version.
How can I search for a multiline pattern in a file? shows how to use pcregrep, but I can't see a way to get it to work when the pattern to search may contain regex special characters. -F
fixed-string mode doesn't usefully work with multi-line mode: it still treats the pattern as a set of lines to be matched separately. (Not as a multi-line fixed-string to be matched.) I see you were already using pcregrep in your attempt.
BTW, I think you have a bug in your code in the non-sudo case:
function writeToFile {
if [ -w "$1" ] ; then
"$2" >> "$1" # probably you mean echo "$2" >> "$1"
else
echo -e "$2" | sudo tee -a "$1" > /dev/null
fi
}
Anyway, attempts at using line-based tools have met with failure, so it's time to pull out a more serious programming language that doesn't force the newline convention on us. Just read both files into variables, and use a non-regex search:
#!/usr/bin/perl -w
# multi_line_match.pl pattern_file target_file
# exit(0) if a match is found, else exit(1)
#use IO::File;
use File::Slurp;
my $pat = read_file($ARGV[0]);
my $target = read_file($ARGV[1]);
if ((substr($target, 0, length($pat)) eq $pat) or index($target, "\n".$pat) >= 0) {
exit(0);
}
exit(1);
See What is the best way to slurp a file into a string in Perl? to avoid the dependency on File::Slurp
(which isn't part of the standard perl distro, or a default Ubuntu 15.04 system). I went for File::Slurp partly for readability of what the program is doing, for non-perl-geeks, compared to:
my $contents = do { local(@ARGV, $/) = $file; <> };
I was working on avoiding reading the full file into memory, with an idea from http://www.perlmonks.org/?node_id=98208. I think non-matching cases would usually still read the whole file at once. Also, the logic was pretty complex for handling a match at the front of the file, and I didn't want to spend a long time testing to make sure it was correct for all cases. Here's what I had before giving up:
#IO::File->input_record_separator($pat);
$/ = $pat; # pat must include a trailing newline if you want it to match one
my $fh = IO::File->new($ARGV[2], O_RDONLY)
or die 'Could not open file ', $ARGV[2], ": $!";
$tail = substr($fh->getline, -1); #fast forward to the first match
#print each occurence in the file
#print IO::File->input_record_separator while $fh->getline;
#FIXME: something clever here to handle the case where $pat matches at the beginning of the file.
do {
# fixme: need to check defined($fh->getline)
if (($tail eq '\n') or ($tail = substr($fh->getline, -1))) {
exit(0); # if there's a 2nd line
}
} while($tail);
exit(1);
$fh->close;
Another idea was to filter patterns and files to be searched through tr '\n' '\r'
or something, so they would all be single-lines. (\r
being a likely safe choice that wouldn't collide with anything already in a file or a pattern.)
回答2:
I would just use diff
for this task:
diff pattern <(grep -f file pattern)
Explanation
diff file1 file2
reports if two files differ or not.By saying
grep -f file pattern
you are seeing what content ofpattern
is infile
.
So what you are doing is to check what lines from pattern
are in file
and then comparing this to pattern
itself. If they match, it means that pattern
is a subset of file
!
Tests
seq 10
is part of seq 20
! Let's check it:
$ diff <(seq 10) <(grep -f <(seq 20) <(seq 10))
$
seq 10
is not exactly inside seq 2 20
(1 is not in the second one):
$ diff -q <(seq 10) <(grep -f <(seq 2 20) <(seq 10))
Files /dev/fd/63 and /dev/fd/62 differ
回答3:
I went through the problem again and I think awk
can handle this better:
awk 'FNR==NR {a[FNR]=$0; next}
FNR==1 && NR>1 {for (i in a) len++}
{for (i=last; i<=len; i++) {
if (a[i]==$0)
{last=i; next}
} status=1}
END {print status+0}' file pattern
The idea is:
- Read all the file file
in memory in an array a[line_number] = line
.
- Count the elements in the array.
- Loop through the file pattern
and check if the current line occurs in file
anytime between where the cursor is and the end of the file file
. If it matches, move the cursor to the position where it was found. If it did not, set the status to 1
- that is, there is a line in pattern
that did not occur in file
after the previous match.
- Print the status, that will be 0
unless it was set to 1
anytime before.
Test
They do match:
$ tail f p
==> f <==
222
333
555
==> p <==
222
333
$ awk 'FNR==NR {a[FNR]=$0; next} FNR==1 && NR>1{for (i in a) len++} {for (i=last; i<=len; i++) {if (a[i]==$0) {last=i; next}} status=1} END {print status+0}' f p
0
They don't:
$ tail f p
==> f <==
333
222
555
==> p <==
222
333
$ awk 'FNR==NR {a[FNR]=$0; next} FNR==1 && NR>1{for (i in a) len++} {for (i=last; i<=len; i++) {if (a[i]==$0) {last=i; next}} status=1} END {print status+0}' f p
1
With seq
:
$ awk 'FNR==NR {a[FNR]=$0; next} FNR==1 && NR>1{for (i in a) len++} {for (i=last; i<=len; i++) {if (a[i]==$0) {last=i; next}} status=1} END {print status+0}' <(seq 2 20) <(seq 10)
1
$ awk 'FNR==NR {a[FNR]=$0; next} FNR==1 && NR>1{for (i in a) len++} {for (i=last; i<=len; i++) {if (a[i]==$0) {last=i; next}} status=1} END {print status+0}' <(seq 20) <(seq 10)
0
来源:https://stackoverflow.com/questions/31540902/how-to-check-if-one-file-is-part-of-other