Get first row of dataframe in Python Pandas based on criteria

前端 未结 4 1520
佛祖请我去吃肉
佛祖请我去吃肉 2020-12-02 20:29

Let\'s say that I have a dataframe like this one

import pandas as pd
df = pd.DataFrame([[1, 2, 1], [1, 3, 2], [4, 6, 3], [4, 3, 4], [5, 4, 5]], columns=[\'A\         


        
相关标签:
4条回答
  • 2020-12-02 20:37

    This tutorial is a very good one for pandas slicing. Make sure you check it out. Onto some snippets... To slice a dataframe with a condition, you use this format:

    >>> df[condition]
    

    This will return a slice of your dataframe which you can index using iloc. Here are your examples:

    1. Get first row where A > 3 (returns row 2)

      >>> df[df.A > 3].iloc[0]
      A    4
      B    6
      C    3
      Name: 2, dtype: int64
      

    If what you actually want is the row number, rather than using iloc, it would be df[df.A > 3].index[0].

    1. Get first row where A > 4 AND B > 3:

      >>> df[(df.A > 4) & (df.B > 3)].iloc[0]
      A    5
      B    4
      C    5
      Name: 4, dtype: int64
      
    2. Get first row where A > 3 AND (B > 3 OR C > 2) (returns row 2)

      >>> df[(df.A > 3) & ((df.B > 3) | (df.C > 2))].iloc[0]
      A    4
      B    6
      C    3
      Name: 2, dtype: int64
      

    Now, with your last case we can write a function that handles the default case of returning the descending-sorted frame:

    >>> def series_or_default(X, condition, default_col, ascending=False):
    ...     sliced = X[condition]
    ...     if sliced.shape[0] == 0:
    ...         return X.sort_values(default_col, ascending=ascending).iloc[0]
    ...     return sliced.iloc[0]
    >>> 
    >>> series_or_default(df, df.A > 6, 'A')
    A    5
    B    4
    C    5
    Name: 4, dtype: int64
    

    As expected, it returns row 4.

    0 讨论(0)
  • 2020-12-02 20:38

    For existing matches, use query:

    df.query(' A > 3' ).head(1)
    Out[33]: 
       A  B  C
    2  4  6  3
    
    df.query(' A > 4 and B > 3' ).head(1)
    Out[34]: 
       A  B  C
    4  5  4  5
    
    df.query(' A > 3 and (B > 3 or C > 2)' ).head(1)
    Out[35]: 
       A  B  C
    2  4  6  3
    
    0 讨论(0)
  • 2020-12-02 20:41

    you can take care of the first 3 items with slicing and head:

    1. df[df.A>=4].head(1)
    2. df[(df.A>=4)&(df.B>=3)].head(1)
    3. df[(df.A>=4)&((df.B>=3) * (df.C>=2))].head(1)

    The condition in case nothing comes back you can handle with a try or an if...

    try:
        output = df[df.A>=6].head(1)
        assert len(output) == 1
    except: 
        output = df.sort_values('A',ascending=False).head(1)
    
    0 讨论(0)
  • 2020-12-02 20:43

    For the point that 'returns the value as soon as you find the first row/record that meets the requirements and NOT iterating other rows', the following code would work:

    def pd_iter_func(df):
        for row in df.itertuples():
            # Define your criteria here
            if row.A > 4 and row.B > 3:
                return row
    

    It is more efficient than Boolean Indexing when it comes to a large dataframe.

    To make the function above more applicable, one can implements lambda functions:

    def pd_iter_func(df: DataFrame, criteria: Callable[[NamedTuple], bool]) -> Optional[NamedTuple]:
        for row in df.itertuples():
            if criteria(row):
                return row
    
    pd_iter_func(df, lambda row: row.A > 4 and row.B > 3)
    

    As mentioned in the answer to the 'mirror' question, pandas.Series.idxmax would also be a nice choice.

    def pd_idxmax_func(df, mask):
        return df.loc[mask.idxmax()]
    
    pd_idxmax_func(df, (df.A > 4) & (df.B > 3))
    
    0 讨论(0)
提交回复
热议问题