Remove duplicates in a list of lists based on the third item in each sublist

流过昼夜 提交于 2019-12-08 03:24:31

I'd do it like this:

seen = set()
cond = [x for x in c if x[3] not in seen and not seen.add(x[3])]

Explanation:

seen is a set which keeps track of already encountered fourth elements of each sublist. cond is the condensed list. In case x[3] (where x is a sublist in c) is not in seen, x will be added to cond and x[3] will be added to seen.

seen.add(x[3]) will return None, so not seen.add(x[3]) will always be True, but that part will only be evaluated if x[3] not in seen is True since Python uses short circuit evaluation. If the second condition gets evaluated, it will always return True and have the side effect of adding x[3] to seen. Here's another example of what's happening (print returns None and has the "side-effect" of printing something):

>>> False and not print('hi')
False
>>> True and not print('hi')
hi
True

You have a significant logic flaw in your current code:

for items in d:
    if bact[3] != items[3]:
        d.append(bact)  

this adds bact to d once for every item in d that doesn't match. For a minimal fix, you need to switch to:

for items in d:
    if bact[3] == items[3]:
        break
else:
    d.append(bact)  

to add bact once if all items in d don't match. I suspect this will mean your code runs in more sensible time.


On top of that, one obvious performance improvement (speed boost, albeit at the cost of memory usage) would be to keep a set of fourth elements you've seen so far. Lookups on the set use hashes, so the membership test (highlighted) will be much quicker.

d = []
seen = set()
for bact in c:
    if bact[3] not in seen: # membership test
        seen.add(bact[3])
        d.append(bact)

Use pandas. I assume you have better column names as well.

c = [['470', '4189.0', 'asdfgw', 'fds'],
     ['470', '4189.0', 'qwer', 'fds'],
     ['470', '4189.0', 'qwer', 'dsfs fdv']]
import pandas as pd
df = pd.DataFrame(c, columns=['col_1', 'col_2', 'col_3', 'col_4'])
df.drop_duplicates('col_4', inplace=True)
print df

  col_1   col_2   col_3     col_4
0   470  4189.0  asdfgw       fds
2   470  4189.0    qwer  dsfs fdv
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!