Why were pandas merges in python faster than data.table merges in R in 2012?

后端 未结 4 1160
攒了一身酷
攒了一身酷 2020-12-22 14:50

I recently came across the pandas library for python, which according to this benchmark performs very fast in-memory merges. It\'s even faster than the data.table package i

4条回答
  •  独厮守ぢ
    2020-12-22 15:24

    The reason pandas is faster is because I came up with a better algorithm, which is implemented very carefully using a fast hash table implementation - klib and in C/Cython to avoid the Python interpreter overhead for the non-vectorizable parts. The algorithm is described in some detail in my presentation: A look inside pandas design and development.

    The comparison with data.table is actually a bit interesting because the whole point of R's data.table is that it contains pre-computed indexes for various columns to accelerate operations like data selection and merges. In this case (database joins) pandas' DataFrame contains no pre-computed information that is being used for the merge, so to speak it's a "cold" merge. If I had stored the factorized versions of the join keys, the join would be significantly faster - as factorizing is the biggest bottleneck for this algorithm.

    I should also add that the internal design of pandas' DataFrame is much more amenable to these kinds of operations than R's data.frame (which is just a list of arrays internally).

提交回复
热议问题