Cythonising Pandas: ctypes for content, index and columns

て烟熏妆下的殇ゞ 提交于 2021-02-06 11:59:19

问题


I am very new to Cython, yet am already experiencing extraordinary speedups just copying my .py to .pyx (and cimport cython, numpy etc) and importing into ipython3 with pyximport. Many tutorials start in this approach with the next step being to add cdef declarations for every data type, which I can do for the iterators in my for loops etc. But unlike most Pandas Cython tutorials or examples I am not apply functions so to speak, more manipulating data using slices, sums and division (etc).

So the question is: Can I increase the speed at which my code runs by stating that my DataFrame only contains floats (double), with columns that are int and rows that are int?

How to define the type of an embedded list? i.e [[int,int],[int]]

Here is an example that generates the AIC score for a partitioning of a DF, sorry it is so verbose:

    cimport cython
    import numpy as np
    cimport numpy as np
    import pandas as pd

    offcat = [
        "breakingPeace", 
        "damage", 
        "deception", 
        "kill", 
        "miscellaneous", 
        "royalOffences", 
        "sexual", 
        "theft", 
        "violentTheft"
        ]

    def partitionAIC(EmpFrame, part, OffenceEstimateFrame, ReturnDeathEstimate=False):
        """EmpFrame is DataFrame of ints, part is nested list of ints, OffenceEstimate frame is DF of float"""
        """partOf/block is a list of ints"""
        """ll, AIC,  is series/frame of floats"""
        ##Cython cdefs
        cdef int DFlen
        cdef int puns
        cdef int DeathPun    
        cdef int k
        cdef int pId
        cdef int punish

        DFlen = EmpFrame.shape[1]
        puns = 2
        DeathPun = 0
        PartitionModel = pd.DataFrame(index = EmpFrame.index, columns = EmpFrame.columns)

        for partOf in part:
            Grouping = [puns*x + y for x in partOf for y in list(range(0,puns))]
            PartGroupSum = EmpFrame.iloc[:,Grouping].sum(axis=1)

            for punish in range(0,puns):
                PunishGroup = [x*puns+punish for x in partOf]
                punishPunishment = ((EmpFrame.iloc[:,PunishGroup].sum(axis = 1) + 1/puns).div(PartGroupSum+1)).values[np.newaxis].T
                PartitionModel.iloc[:,PunishGroup] = punishPunishment
        PartitionModel = PartitionModel*OffenceEstimateFrame

        if ReturnDeathEstimate:
            DeathProbFrame = pd.DataFrame([[part]], index=EmpFrame.index, columns=['Partition'])
            for pId,block in enumerate(part):
                DeathProbFrame[pId] = PartitionModel.iloc[:,block[::puns]].sum(axis=1)
            DeathProbFrame = DeathProbFrame.apply(lambda row: sorted( [ [format("%6.5f"%row[idx])]+[offcat[X] for X in  x ] 
                for idx,x in enumerate(row['Partition'])],
                key=lambda x: x[0], reverse=True),axis=1)
        ll = (EmpFrame*np.log(PartitionModel.convert_objects(convert_numeric=True))).sum(axis=1)
        k = (len(part))*(puns-1)
        AIC = 2*k-2*ll

        if ReturnDeathEstimate:
            return AIC, DeathProbFrame
        else:
            return AIC

回答1:


My advice is to do as much as possible in pandas. This is kinda standard advice "get it working first, then care about performance if it really matters". So let's suppose you've done that (hopefully you've written some tests too), and it's too slow:

Profile your code. (See this SO answer, or use %prun in ipython).

The output of prun should drive what bit to improve next.

  1. pandas (make your code more pandorable, this can help a lot).
  2. numpy (not creating intermediary Series/DataFrames, being careful about dtypes)
  3. cython (the last resort).

Now, if it is a line to do with slicing (it probably isn't) put that tiny part in cython, I like to remove single python function calls to cython function. On that point stuff with cython should use numpy not pandas, I don't think pandas is not going to lower to C (cython can't infer types).


Putting your entire code into cython won't actually help that much, you want to only put the specific lines, or function calls, which are performance sensitive. Keeping cython focussed is the only way to have a good time.

Read the enhancing performance section of the pandas docs*! Here this process (prun -> cythonize -> type) is gone over step-by-step with a real-life example.

*Full-disclose I wrote it that section of the docs! :)



来源:https://stackoverflow.com/questions/29871723/cythonising-pandas-ctypes-for-content-index-and-columns

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!