I\'m trying to solve a Rosalind basic problem of counting nucleotides in a given sequence, and returning the results in a list. For those ones not familiar with bioinformati
The time difference here is pretty simple to explain. It all comes down to what runs within Python and what runs as native code. The latter will always be faster since it does not come with lots of evaluation overhead.
Now that’s already the reason why calling str.count() four times is faster than anything else. Although this iterates the string four times, these loops run in native code. str.count is implemented in C, so this has very little overhead, making this very fast. It’s really difficult to beat this, especially when the task is that simple (looking only for simple character equality).
Your second method, of collecting the counts in an array is actually a less performant version of the following:
def method4 (seq):
a, c, g, t = 0, 0, 0, 0
for i in seq:
if i == 'A':
a += 1
elif i == 'C':
c += 1
elif i == 'G':
g += 1
else:
t += 1
return [a, c, g, t]
Here, all four values are individual variables, so updating them is very fast. This is actually a bit faster than mutating list items.
The overall performance “problem” here is however that this iterates the string within Python. So this creates a string iterator and then produces every character individually as an actual string object. That’s a lot overhead and the main reason why every solution that works by iterating the string in Python will be slower.
The same problem is with collection.Counter. It’s implemented in Python so even though it’s very efficient and flexible, it suffers from the same issue that it’s just never near native in terms of speed.