问题
I have two files:
candidates.csv:
id,value
1,123
4,1
2,5
50,5
blacklist.csv:
1
2
5
3
10
I'd like to remove all rows from candidates.csv in which the first column (id) has a value contained in blacklist.csv. id is always numeric. In this case I'd like my output to look like this:
id,value
4,1
50,5
So far, my script for identifying the duplicate lines looks like this:
cat candidates.csv | cut -d \, -f 1 | grep -f blacklist.csv -w
This gives me the output
1
2
Now I somehow need to pipe this information back into sed/awk/gawk/... to delete the duplicates, but I don't know how. Any ideas how I can continue from here? Or is there a better solution altogether? My only restriction is that it has to run in bash.
回答1:
What about the following:
awk -F, '(NR==FNR){a[$1];next}!($1 in a)' blacklist.csv candidates.csv
How does this work?
An awk program is a series of pattern-action pairs, written as:
condition { action }
condition { action }
...
where condition is typically an expression and action a series of commands. Here, the first condition-action pairs read:
(NR==FNR){a[$1];next}if the total record countNRequals the record count of the fileFNR(i.e. if we are reading the first file), store all values in arrayaand skip to the next record (do not do anything else)!($1 in a)if the first field is not in the arrayathen perform the default action which is print the line. This will only work on the second file as the condition of the first condition-action pair does not hold.
回答2:
If you are not too concerned about the order of the lines in your candidates.csv file you could use the following:
join -v 1 -t, <(sort -t, candidates.csv) <(sort blacklist.csv)
The -v 1 requests all the lines from the first file (the sorted candidates.csv) which do not match on the first field with the second file (the blacklist.csv). The -t, just sets the comma as a separator.
If you're concerned about the header line in the candidates.csv file you could strip it before the sorting or change the order.
回答3:
You can use sed and grep together to get the output
$ sed -e 's/[0-9]+/&\,/g' blacklist.csv > filter.csv
$ grep -Fvf filter.csv candidates.csv
id,value
4,1
50,5
sed command adds a , to each id and output to a filter.csv. E is to interprete regex in MacOSX/FreeBSD, same as -r in GNU sed.
grep uses the option f to compare between files, then remove the lines using v. F is for fixed string.
来源:https://stackoverflow.com/questions/52649404/how-to-delete-rows-from-a-csv-file-based-on-a-list-values-from-another-file