Since Python\'s string
can\'t be changed, I was wondering how to concatenate a string more efficiently?
I can write like it:
s += string
As @jdi mentions Python documentation suggests to use str.join
or io.StringIO
for string concatenation. And says that a developer should expect quadratic time from +=
in a loop, even though there's an optimisation since Python 2.4. As this answer says:
If Python detects that the left argument has no other references, it calls
realloc
to attempt to avoid a copy by resizing the string in place. This is not something you should ever rely on, because it's an implementation detail and because ifrealloc
ends up needing to move the string frequently, performance degrades to O(n^2) anyway.
I will show an example of real-world code that naively relied on +=
this optimisation, but it didn't apply. The code below converts an iterable of short strings into bigger chunks to be used in a bulk API.
def test_concat_chunk(seq, split_by):
result = ['']
for item in seq:
if len(result[-1]) + len(item) > split_by:
result.append('')
result[-1] += item
return result
This code can literary run for hours because of quadratic time complexity. Below are alternatives with suggested data structures:
import io
def test_stringio_chunk(seq, split_by):
def chunk():
buf = io.StringIO()
size = 0
for item in seq:
if size + len(item) <= split_by:
size += buf.write(item)
else:
yield buf.getvalue()
buf = io.StringIO()
size = buf.write(item)
if size:
yield buf.getvalue()
return list(chunk())
def test_join_chunk(seq, split_by):
def chunk():
buf = []
size = 0
for item in seq:
if size + len(item) <= split_by:
buf.append(item)
size += len(item)
else:
yield ''.join(buf)
buf.clear()
buf.append(item)
size = len(item)
if size:
yield ''.join(buf)
return list(chunk())
And a micro-benchmark:
import timeit
import random
import string
import matplotlib.pyplot as plt
line = ''.join(random.choices(
string.ascii_uppercase + string.digits, k=512)) + '\n'
x = []
y_concat = []
y_stringio = []
y_join = []
n = 5
for i in range(1, 11):
x.append(i)
seq = [line] * (20 * 2 ** 20 // len(line))
chunk_size = i * 2 ** 20
y_concat.append(
timeit.timeit(lambda: test_concat_chunk(seq, chunk_size), number=n) / n)
y_stringio.append(
timeit.timeit(lambda: test_stringio_chunk(seq, chunk_size), number=n) / n)
y_join.append(
timeit.timeit(lambda: test_join_chunk(seq, chunk_size), number=n) / n)
plt.plot(x, y_concat)
plt.plot(x, y_stringio)
plt.plot(x, y_join)
plt.legend(['concat', 'stringio', 'join'], loc='upper left')
plt.show()