Converting a series of ints to strings - Why is apply much faster than astype?

前端未结

关注

 2  482

借酒劲吻你 2020-12-05 00:31

I have a pandas.Series containing integers, but I need to convert these to strings for some downstream tools. So suppose I had a Series object:

2条回答

忘掉有多难 (楼主)

2020-12-05 01:08
Performance

It's worth looking at actual performance before beginning any investigation since, contrary to popular opinion, list(map(str, x)) appears to be slower than x.apply(str).
```
import pandas as pd, numpy as np

### Versions: Pandas 0.20.3, Numpy 1.13.1, Python 3.6.2 ###

x = pd.Series(np.random.randint(0, 100, 100000))

%timeit x.apply(str)          # 42ms   (1)
%timeit x.map(str)            # 42ms   (2)
%timeit x.astype(str)         # 559ms  (3)
%timeit [str(i) for i in x]   # 566ms  (4)
%timeit list(map(str, x))     # 536ms  (5)
%timeit x.values.astype(str)  # 25ms   (6)
```
Points worth noting:
1. (5) is marginally quicker than (3) / (4), which we expect as more work is moved into C [assuming no lambda function is used].
2. (6) is by far the fastest.
3. (1) / (2) are similar.
4. (3) / (4) are similar.
Why is x.map / x.apply fast?

This appears to be because it uses fast compiled Cython code:
```
cpdef ndarray[object] astype_str(ndarray arr):
    cdef:
        Py_ssize_t i, n = arr.size
        ndarray[object] result = np.empty(n, dtype=object)

    for i in range(n):
        # we can use the unsafe version because we know `result` is mutable
        # since it was created from `np.empty`
        util.set_value_at_unsafe(result, i, str(arr[i]))

    return result
```
Why is x.astype(str) slow?

Pandas applies str to each item in the series, not using the above Cython.

Hence performance is comparable to [str(i) for i in x] / list(map(str, x)).

Why is x.values.astype(str) so fast?

Numpy does not apply a function on each element of the array. One description of this I found:

If you did s.values.astype(str) what you get back is an object holding int. This is numpy doing the conversion, whereas pandas iterates over each item and calls str(item) on it. So if you do s.astype(str) you have an object holding str.

There is a technical reason why the numpy version hasn't been implemented in the case of no-nulls.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...