I have a pandas.Series containing integers, but I need to convert these to strings for some downstream tools. So suppose I had a Series object:
Performance
It's worth looking at actual performance before beginning any investigation since, contrary to popular opinion, list(map(str, x)) appears to be slower than x.apply(str).
import pandas as pd, numpy as np
### Versions: Pandas 0.20.3, Numpy 1.13.1, Python 3.6.2 ###
x = pd.Series(np.random.randint(0, 100, 100000))
%timeit x.apply(str) # 42ms (1)
%timeit x.map(str) # 42ms (2)
%timeit x.astype(str) # 559ms (3)
%timeit [str(i) for i in x] # 566ms (4)
%timeit list(map(str, x)) # 536ms (5)
%timeit x.values.astype(str) # 25ms (6)
Points worth noting:
lambda function is used].Why is x.map / x.apply fast?
This appears to be because it uses fast compiled Cython code:
cpdef ndarray[object] astype_str(ndarray arr):
cdef:
Py_ssize_t i, n = arr.size
ndarray[object] result = np.empty(n, dtype=object)
for i in range(n):
# we can use the unsafe version because we know `result` is mutable
# since it was created from `np.empty`
util.set_value_at_unsafe(result, i, str(arr[i]))
return result
Why is x.astype(str) slow?
Pandas applies str to each item in the series, not using the above Cython.
Hence performance is comparable to [str(i) for i in x] / list(map(str, x)).
Why is x.values.astype(str) so fast?
Numpy does not apply a function on each element of the array. One description of this I found:
If you did
s.values.astype(str)what you get back is an object holdingint. This isnumpydoing the conversion, whereas pandas iterates over each item and callsstr(item)on it. So if you dos.astype(str)you have an object holdingstr.
There is a technical reason why the numpy version hasn't been implemented in the case of no-nulls.