Why were pandas merges in python faster than data.table merges in R in 2012?

后端 未结 4 1162
攒了一身酷
攒了一身酷 2020-12-22 14:50

I recently came across the pandas library for python, which according to this benchmark performs very fast in-memory merges. It\'s even faster than the data.table package i

4条回答
  •  Happy的楠姐
    2020-12-22 15:22

    There are great answers, notably made by authors of both tools that question asks about. Matt's answer explain the case reported in the question, that it was caused by a bug, and not an merge algorithm. Bug was fixed on the next day, more than a 7 years ago already.

    In my answer I will provide some up-to-date timings of merging operation for data.table and pandas. Note that plyr and base R merge are not included.

    Timings I am presenting are coming from db-benchmark project, a continuously run reproducible benchmark. It upgrades tools to recent versions and re-run benchmark scripts. It runs many other software solutions. If you are interested in Spark, Dask and few others be sure to check the link.


    As of now... (still to be implemented: one more data size and 5 more questions)

    We tests 2 different data sizes of LHS table.
    For each of those data sizes we run 5 different merge questions.

    q1: LHS inner join RHS-small on integer
    q2: LHS inner join RHS-medium on integer
    q3: LHS outer join RHS-medium on integer
    q4: LHS inner join RHS-medium on factor (categorical)
    q5: LHS inner join RHS-big on integer

    RHS table is of 3 various sizes

    • small translates to size of LHS/1e6
    • medium translates to size of LHS/1e3
    • big translates to size of LHS

    In all cases there are around 90% of matching rows between LHS and RHS, and no duplicates in RHS joining column (no cartesian product).


    As of now (run on 2nd Nov 2019)

    pandas 0.25.3 released on 1st Nov 2019
    data.table 0.12.7 (92abb70) released on 2nd Nov 2019

    Below timings are in seconds, for two different data sizes of LHS. Column pd2dt is added field storing ratio of how many times pandas is slower than data.table.

    • 0.5 GB LHS data
    +-----------+--------------+----------+--------+
    | question  |  data.table  |  pandas  |  pd2dt |
    +-----------+--------------+----------+--------+
    | q1        |        0.51  |    3.60  |      7 |
    | q2        |        0.50  |    7.37  |     14 |
    | q3        |        0.90  |    4.82  |      5 |
    | q4        |        0.47  |    5.86  |     12 |
    | q5        |        2.55  |   54.10  |     21 |
    +-----------+--------------+----------+--------+
    
    • 5 GB LHS data
    +-----------+--------------+----------+--------+
    | question  |  data.table  |  pandas  |  pd2dt |
    +-----------+--------------+----------+--------+
    | q1        |        6.32  |    89.0  |     14 |
    | q2        |        5.72  |   108.0  |     18 |
    | q3        |       11.00  |    56.9  |      5 |
    | q4        |        5.57  |    90.1  |     16 |
    | q5        |       30.70  |   731.0  |     23 |
    +-----------+--------------+----------+--------+
    

提交回复
热议问题