问题
I have been looking for a while how to remove duplicates of my csv files. I started with a file with multiple fields but then I realize that I could just work with one file with 2 field and then merge the files using the first field. Here is what I want to do: I have this file CSV file and as you can see there are genes with more than one description. Some of them have two descriptions, one is "hypothetical protein" and other is "something else". In that case I want to remove the one with "hypothetical protein" and keep the line with "something else". However, if there is more than one description, I can just keep the first one. I have been trying it with awk. It would be great if I could use awk for it.
Input example:
AAEL018330 hypothetical protein
AAEL018330 tropomyosin, putative
AAEL018331 hypothetical protein
AAEL018332
AAEL018333 hypothetical protein
AAEL018333 colmedin
Output wanted:
AAEL018330 tropomyosin, putative
AAEL018331 hypothetical protein
AAEL018332
AAEL018333 colmedin
Thank you.
回答1:
In the general (unsorted) case if you want to keep the last entry of a line by field you can use something like:
awk '{seen[$1]=$0} END {for (i in seen) {print seen[i]}}' file
Though that isn't guaranteed to keep sort order.
In this case, with sorted input something like this should work:
awk 'f!=$1 && line{print line} {f=$1; line=$0} END {print line}' file
来源:https://stackoverflow.com/questions/26598927/remove-duplicate-lines-but-keep-the-one-that-does-not-have-a-string