Remove duplicate lines but keep the one that does not have a string

别来无恙 提交于 2019-12-12 04:36:41

问题


I have been looking for a while how to remove duplicates of my csv files. I started with a file with multiple fields but then I realize that I could just work with one file with 2 field and then merge the files using the first field. Here is what I want to do: I have this file CSV file and as you can see there are genes with more than one description. Some of them have two descriptions, one is "hypothetical protein" and other is "something else". In that case I want to remove the one with "hypothetical protein" and keep the line with "something else". However, if there is more than one description, I can just keep the first one. I have been trying it with awk. It would be great if I could use awk for it.

Input example:

AAEL018330  hypothetical protein
AAEL018330  tropomyosin, putative
AAEL018331  hypothetical protein
AAEL018332  
AAEL018333  hypothetical protein
AAEL018333  colmedin

Output wanted:

AAEL018330  tropomyosin, putative
AAEL018331  hypothetical protein
AAEL018332  
AAEL018333  colmedin

Thank you.


回答1:


In the general (unsorted) case if you want to keep the last entry of a line by field you can use something like:

awk '{seen[$1]=$0} END {for (i in seen) {print seen[i]}}' file

Though that isn't guaranteed to keep sort order.

In this case, with sorted input something like this should work:

awk 'f!=$1 && line{print line} {f=$1; line=$0} END {print line}' file


来源:https://stackoverflow.com/questions/26598927/remove-duplicate-lines-but-keep-the-one-that-does-not-have-a-string

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!