Python string concatenation Idiom. Need Clarification.

末鹿安然 提交于 2019-12-01 07:38:06

Yes. For the examples you chose the importance isn't clear because you only have two very short strings so the append would probably be faster.

But every time you do a + b with strings in Python it causes a new allocation and then copies all the bytes from a and b into the new string. If you do this in a loop with lots of strings these bytes have to be copied again, and again, and again and each time the amount that has to be copied gets longer. This gives the quadratic behaviour.

On the other hand, creating a list of strings doesn't copy the contents of the strings - it just copies the references. This is incredibly fast, and runs in linear time. The join method then makes just one memory allocation and copies each string into the correct position only once. This also takes only linear time.

So yes, do use the ''.join idiom if you are potentially dealing with a large number of strings. For just two strings it doesn't matter.

If you need more convincing, try it for yourself creating a string from 10M characters:

>>> chars = ['a'] * 10000000
>>> r = ''
>>> for c in chars: r += c
>>> print len(r)

Compared with:

>>> chars = ['a'] * 10000000
>>> r = ''.join(chars)
>>> print len(r)

The first method takes about 10 seconds. The second takes under 1 second.

Repeated concatenation is quadratic because it's Schlemiel the Painter's Algorithm (beware that some implementations will optimize this away so that it is not quadratic). join avoids this because it takes the entire list of strings, allocates the necessary space and does the concatenation in one pass.

Alex Martelli

When you code s1 + s2, Python needs to allocate a new string object, copy all characters of s1 into it, then after that all characters of s2. This trivial operation does not bear quadratic time costs: the cost is O(len(s1) + len(s2)) (plus a constant for allocation, but that doesn't figure in big-O;-).

However, consider the code in the quote you're giving: for s in strings: result += s.

Here, every time a new s is added, all the previous ones have to be first copied into the newly allocated space for result (strings are immutable, so the new allocation and copy must take place). Suppose you have N strings of length L: you'll copy L characters the first time, then 2 * L the second time, then 3 * L the third time... in all, that makes it L * N * (N+1) / 2 characters getting copied... so, yep, it's quadratic in N.

In some other cases, a quadratic algorithm may be faster than a linear one for small-enough values of N (because the multipliers and constant fixed-costs may be much smaller); but that's not the case here because allocations are costly (both directly, and indirectly because of the likelihood of fragmenting memory). In comparison, the overheads of accumulating the strings into a list is essentially negligible.

Joel writes about this in Back to Basics.

It's not obvious if you're referring to the same thing as other people. This optimization is important when you have many strings, say M of length N. Then

A

x = ''.join(strings) # Takes M*N operations 

B

x = ''
for s in strings:
    x = x + s  # Takes N + 2*N + ... + M*N operations

Unless optimized away by the implementation, yes, A is linear in the total length T = M*N and B is T*T / N which is always worse and roughly quadratic if M >> N.

Now why it is actually quite intuitive to join: when you say "I have some strings" this typically can be formalized by saying that you have an iterator that returns strings. Now, this is exactly what you pass to "string".join()

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!