“Piping” output from one function to another using Python infix syntax

前端 未结 5 806
-上瘾入骨i
-上瘾入骨i 2020-12-14 02:17

I\'m trying to replicate, roughly, the dplyr package from R using Python/Pandas (as a learning exercise). Something I\'m stuck on is the \"piping\" functionality.

In

相关标签:
5条回答
  • 2020-12-14 02:41

    While I can't help mentioning that using dplyr in Python might the closest thing to having in dplyr in Python (it has the rshift operator, but as a gimmick), I'd like to also point out that the pipe operator might only be necessary in R because of its use of generic functions rather than methods as object attributes. Method chaining gives you essentially the same without having to override operators:

    dataf = (DataFrame(mtcars).
             filter('gear>=3').
             mutate(powertoweight='hp*36/wt').
             group_by('gear').
             summarize(mean_ptw='mean(powertoweight)'))
    

    Note wrapping the chain between a pair of parenthesis lets you break it into multiple lines without the need for a trailing \ on each line.

    0 讨论(0)
  • 2020-12-14 02:41

    I would argue strongly against doing this or any of the answers suggested here and just implement a pipe function in standard python code, without operator trickery, decorators or what not:

    def pipe(first, *args):
      for fn in args:
        first = fn(first)
      return first
    

    See my answer here for more background: https://stackoverflow.com/a/60621554/2768350

    Overloading operators, involving external libraries and what not serve to make the code less readable, less maintainable, less testable and less pythonic. If I want to do some kind of pipe in python, I would not want to do more than pipe(input, fn1, fn2, fn3). Thats the most readable & robust solution I can think of. If someone in our company committed operator overloading or new dependencies to production just to do a pipe, it would get immediately reverted and they would be sentenced to doing QA checks the rest of the week :D If you really really really must use some sort of operator for pipe, then maybe you have bigger problems and Python is not the right language for your use case...

    0 讨论(0)
  • 2020-12-14 02:44

    I couldn't find a built-in way of doing this, so I created a class that uses the __call__ operator because it supports *args/**kwargs:

    class Pipe:
        def __init__(self, value):
            """
            Creates a new pipe with a given value.
            """
            self.value = value
        def __call__(self, func, *args, **kwargs):
            """
            Creates a new pipe with the value returned from `func` called with
            `args` and `kwargs` and it's easy to save your intermedi.
            """
            value = func(self.value, *args, **kwargs)
            return Pipe(value)
    

    The syntax takes some getting used to, but it allows for piping.

    def get(dictionary, key):
        assert isinstance(dictionary, dict)
        assert isinstance(key, str)
        return dictionary.get(key)
    
    def keys(dictionary):
        assert isinstance(dictionary, dict)
        return dictionary.keys()
    
    def filter_by(iterable, check):
        assert hasattr(iterable, '__iter__')
        assert callable(check)
        return [item for item in iterable if check(item)]
    
    def update(dictionary, **kwargs):
        assert isinstance(dictionary, dict)
        dictionary.update(kwargs)
        return dictionary
    
    
    x = Pipe({'a': 3, 'b': 4})(update, a=5, c=7, d=8, e=1)
    y = (x
        (keys)
        (filter_by, lambda key: key in ('a', 'c', 'e', 'g'))
        (set)
        ).value
    z = x(lambda dictionary: dictionary['a']).value
    
    assert x.value == {'a': 5, 'b': 4, 'c': 7, 'd': 8, 'e': 1}
    assert y == {'a', 'c', 'e'}
    assert z == 5
    
    0 讨论(0)
  • 2020-12-14 02:47

    It is hard to implement this using the bitwise or operator because pandas.DataFrame implements it. If you don't mind replacing | with >>, you can try this:

    import pandas as pd
    
    def select(df, *args):
        cols = [x for x in args]
        return df[cols]
    
    
    def rename(df, **kwargs):
        for name, value in kwargs.items():
            df = df.rename(columns={'%s' % name: '%s' % value})
        return df
    
    
    class SinkInto(object):
        def __init__(self, function, *args, **kwargs):
            self.args = args
            self.kwargs = kwargs
            self.function = function
    
        def __rrshift__(self, other):
            return self.function(other, *self.args, **self.kwargs)
    
        def __repr__(self):
            return "<SinkInto {} args={} kwargs={}>".format(
                self.function, 
                self.args, 
                self.kwargs
            )
    
    df = pd.DataFrame({'one' : [1., 2., 3., 4., 4.],
                       'two' : [4., 3., 2., 1., 3.]})
    

    Then you can do:

    >>> df
       one  two
    0    1    4
    1    2    3
    2    3    2
    3    4    1
    4    4    3
    
    >>> df = df >> SinkInto(select, 'one') \
                >> SinkInto(rename, one='new_one')
    >>> df
       new_one
    0        1
    1        2
    2        3
    3        4
    4        4
    

    In Python 3 you can abuse unicode:

    >>> print('\u01c1')
    ǁ
    >>> ǁ = SinkInto
    >>> df >> ǁ(select, 'one') >> ǁ(rename, one='new_one')
       new_one
    0        1
    1        2
    2        3
    3        4
    4        4
    

    [update]

    Thanks for your response. Would it be possible to make a separate class (like SinkInto) for each function to avoid having to pass the functions as an argument?

    How about a decorator?

    def pipe(original):
        class PipeInto(object):
            data = {'function': original}
    
            def __init__(self, *args, **kwargs):
                self.data['args'] = args
                self.data['kwargs'] = kwargs
    
            def __rrshift__(self, other):
                return self.data['function'](
                    other, 
                    *self.data['args'], 
                    **self.data['kwargs']
                )
    
        return PipeInto
    
    
    @pipe
    def select(df, *args):
        cols = [x for x in args]
        return df[cols]
    
    
    @pipe
    def rename(df, **kwargs):
        for name, value in kwargs.items():
            df = df.rename(columns={'%s' % name: '%s' % value})
        return df
    

    Now you can decorate any function that takes a DataFrame as the first argument:

    >>> df >> select('one') >> rename(one='first')
       first
    0      1
    1      2
    2      3
    3      4
    4      4
    

    Python is awesome!

    I know that languages like Ruby are "so expressive" that it encourages people to write every program as new DSL, but this is kind of frowned upon in Python. Many Pythonists consider operator overloading for a different purpose as a sinful blasphemy.

    [update]

    User OHLÁLÁ is not impressed:

    The problem with this solution is when you are trying to call the function instead of piping. – OHLÁLÁ

    You can implement the dunder-call method:

    def __call__(self, df):
        return df >> self
    

    And then:

    >>> select('one')(df)
       one
    0  1.0
    1  2.0
    2  3.0
    3  4.0
    4  4.0
    

    Looks like it is not easy to please OHLÁLÁ:

    In that case you need to call the object explicitly:
    select('one')(df) Is there a way to avoid that? – OHLÁLÁ

    Well, I can think of a solution but there is a caveat: your original function must not take a second positional argument that is a pandas dataframe (keyword arguments are ok). Lets add a __new__ method to our PipeInto class inside the docorator that tests if the first argument is a dataframe, and if it is then we just call the original function with the arguments:

    def __new__(cls, *args, **kwargs):
        if args and isinstance(args[0], pd.DataFrame):
            return cls.data['function'](*args, **kwargs)
        return super().__new__(cls)
    

    It seems to work but probably there is some downside I was unable to spot.

    >>> select(df, 'one')
       one
    0  1.0
    1  2.0
    2  3.0
    3  4.0
    4  4.0
    
    >>> df >> select('one')
       one
    0  1.0
    1  2.0
    2  3.0
    3  4.0
    4  4.0
    
    0 讨论(0)
  • 2020-12-14 02:47

    You can use sspipe library, and use the following syntax:

    from sspipe import p
    df = df | p(select, 'one') \
            | p(rename, one = 'new_one')
    
    0 讨论(0)
提交回复
热议问题