Python to remove duplicates using only some, not all, columns

问题

I have a tab-delimited input.txt file like this

A    B    C
A    B    D
E    F    G
E    F    T
E    F    K

These are tab-delimited.

I want to remove duplicates only when multiple rows have the same 1st and 2nd columns.

So, even though 1st and 2nd rows are different in 3rd column, they have the same 1st and 2nd columns, so I want to remove "A B D" that appears later.

So output.txt will be like this.

A    B    C
E    F    G

If I was to remove duplicates in usual way, I just make the lists into "set" function, and I am all set.

But now I am trying to remove duplicates using only "some" columns.

Using excel, it's just so easy.

Data -> Remove Duplicates -> Select columns

Using MatLab, it's easy, too.

import input.txt -> Use "unique" function with respect to 1st and 2nd columns -> Remove the rows numbered "1"

But using python, I couldn't find how to do this because all I knew about removing duplicate was using "set" in python.

===========================

This is what I experimented following undefined_is_not_a_function's answer.

I am not sure how to overwrite the result to output.txt, and how to alter the code to let me specify the columns to use for duplicate-removing (like 3 and 5).

import sys
input = sys.argv[1]

seen = set()
data = []
for line in input.splitlines():
    key = tuple(line.split(None, 2)[0])
    if key not in seen:
        data.append(line)
        seen.add(key)

回答1:

You should use itertools.groupby for this. Here I am grouping the data based on first first two columns and then using next() to get the first item from each group.

>>> from itertools import groupby                                   
>>> s = '''A    B    C                                              
A    B    D
E    F    G
E    F    T
E    F    K'''
>>> for k, g in groupby(s.splitlines(), key=lambda x:x.split()[:2]):
    print next(g)
...     
A    B    C
E    F    G

Simply replace s.splitlines() with file object if input is coming from a file.

Note that the above solution will work only if data is sorted as per first two columns, if that's not the case then you'll have to use a set here.

>>> from operator import itemgetter
>>> ig = itemgetter(0, 1) #Pass any column number you want, note that indexing starts at 0
>>> s = '''A    B    C
A    B    D
E    F    G
E    F    T
E    F    K
A    B    F'''     
>>> seen = set()
>>> data = []
>>> for line in s.splitlines():
...     key = ig(line.split())
...     if key not in seen:
...         data.append(line)
...         seen.add(key)
...         
>>> data
['A    B    C', 'E    F    G']

回答2:

if you have access to a Unix system, sort is a nice utility that is made for your problem.

sort -u -t$'\t' --key=1,2 filein.txt

I know this is a Python question, but sometimes Python is not the tool for the task. And you can always embed a system call in your python script.

回答3:

from the below code, you can do it.

file_ = open('yourfile.txt')
lst = []
for each_line in file_ .read().split('\n'):
    li = each_line .split()
    lst.append(li)
dic = {}
for l in lst:
    if (l[0], l[1]) not in dic:
        dic[(l[0], l[1])] = l[2]

print dic

sorry for variable names.

回答4:

Assuming that you have already read your object, and that you have an array named rows(tell me if you need help with that), the following code should work:

entries = set()
keys = set()
for row in rows:
   key = (row[0], row[1]) # Only the first two columns

   if key not in keys:
      keys.add(key)
      entries.add((row[0], row[1], row[2]))

回答5:

please notice that I am not an expert but I still have ideas that may help you.

There is a csv module useful for csv files, you might go see there if you find something interesting.

First I would ask how are you storing those datas ? In a list ?

something like

[[A,B,C],
[A,B,D],
[E,F,G],...]

Could be suitable. (maybe not the best choice)

Second, is it possible to go through the whole list ?

You can simply store a line, compare it to all lines.

I would do this : suposing list contains the letters.

copy = list
index_list = []
for i in range(0, len(list)-1):
    for j in range(0, len(list)-1): #and exclude i of course
     if copy[i][1] == list[j][1] and copy[i][0] == list[j][0] and i!=j:
          index_list.append(j)
for i in index_list: #just loop over the index list and remove
list.pop(index_list[i])

this is not working code but it gives you the idea. It is the simplest idea to perform your task, and not likely the most suitable. (and it will take a while, since you need to perform a quadratic number of operations). Edit : pop; not remove

来源：https://stackoverflow.com/questions/25035829/python-to-remove-duplicates-using-only-some-not-all-columns

标签

python

file

duplicate-removal

tab-delimited-text