I just started learning python and here I have a sorted list of protein sequences (total 59,000 sequences) and some of them overlap. I have made a toy list here for example:
Not an exact match with your expectations, but, given that you state it's sorted (and it's not, near EOEUDNBNUWD EAEUDNBNUW
) and that I don't know why you're missing EOEUDNBNUWD
I am not sure if your expectations are correctly stated or if I've misread your question.
(ah, yes, I see the notion of overlap throws a wrench into the sort
and startswith
approach).
Might be nice for the OP to restate that particular aspect, I read @DSM comment without really understanding his concern. Now I do.
li = sorted([i.strip() for i in """
ABCDE
ABCDEFG
ABCDEFGH
ABCDEFGHIJKLMNO
CEST
DBTSFDE
DBTSFDEO
EOEUDNBNUW
EOEUDNBNUWD
EAEUDNBNUW
FEOEUDNBNUW
FG
FGH""".splitlines() if i.strip()])
def get_iter(li):
prev = ""
for i in li:
if not i.startswith(prev):
yield(prev)
prev = i
yield prev
for v in get_iter(li):
print(v)
output:
ABCDEFGHIJKLMNO
CEST
DBTSFDEO
EAEUDNBNUW
EOEUDNBNUWD
FEOEUDNBNUW
FGH