Why were pandas merges in python faster than data.table merges in R in 2012?

后端未结

关注

 4  1160

攒了一身酷 2020-12-22 14:50

I recently came across the pandas library for python, which according to this benchmark performs very fast in-memory merges. It\'s even faster than the data.table package i

4条回答

独厮守ぢ (楼主)

2020-12-22 15:24

The reason pandas is faster is because I came up with a better algorithm, which is implemented very carefully using a fast hash table implementation - klib and in C/Cython to avoid the Python interpreter overhead for the non-vectorizable parts. The algorithm is described in some detail in my presentation: A look inside pandas design and development.

The comparison with data.table is actually a bit interesting because the whole point of R's data.table is that it contains pre-computed indexes for various columns to accelerate operations like data selection and merges. In this case (database joins) pandas' DataFrame contains no pre-computed information that is being used for the merge, so to speak it's a "cold" merge. If I had stored the factorized versions of the join keys, the join would be significantly faster - as factorizing is the biggest bottleneck for this algorithm.

I should also add that the internal design of pandas' DataFrame is much more amenable to these kinds of operations than R's data.frame (which is just a list of arrays internally).

0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...