Pandas .loc without KeyError

前端 未结 5 2207
南方客
南方客 2020-12-17 23:57
>>> pd.DataFrame([1], index=[\'1\']).loc[\'2\']  # KeyError
>>> pd.DataFrame([1], index=[\'1\']).loc[[\'2\']]  # KeyError
>>> pd.DataFrame         


        
相关标签:
5条回答
  • 2020-12-18 00:01

    It seems to work fine for me. I'm running Python 3.5 with pandas version 0.20.3.

    import numpy as np
    import pandas as pd
    
    # Create dataframe
    data = {'distance': [0, 300, 600, 1000],
            'population': [4.8, 0.7, 6.4, 2.9]}
    df = pd.DataFrame(data, index=['Alabama','Alaska','Arizona','Arkansas'])
    
    keys = ['Alabama', 'Alaska', 'Arizona', 'Virginia']
    
    # Create a subset of the dataframe.
    df.loc[keys]
              distance  population
    Alabama        0.0         4.8
    Alaska       300.0         0.7
    Arizona      600.0         6.4
    Virginia       NaN         NaN
    

    Or if you want to exclude the NaN row:

    df.loc[keys].dropna()
              distance  population
    Alabama        0.0         4.8
    Alaska       300.0         0.7
    Arizona      600.0         6.4
    
    0 讨论(0)
  • 2020-12-18 00:10

    Using the sample dataframe from @binjip's answer:

    import numpy as np
    import pandas as pd
    
    # Create dataframe
    data = {'distance': [0, 300, 600, 1000],
            'population': [4.8, 0.7, 6.4, 2.9]}
    df = pd.DataFrame(data, index=['Alabama','Alaska','Arizona','Arkansas'])
    
    keys = ['Alabama', 'Alaska', 'Arizona', 'Virginia']
    
    

    Get matching records from the dataframe. NB: The dataframe index must be unique for this to work!

    df.reindex(keys)
    
              distance  population
    Alabama        0.0         4.8
    Alaska       300.0         0.7
    Arizona      600.0         6.4
    Virginia       NaN         NaN
    

    If you want to omit missing keys:

    df.reindex(df.index.intersection(keys))
    
             distance  population
    Alabama         0         4.8
    Alaska        300         0.7
    Arizona       600         6.4
    
    0 讨论(0)
  • 2020-12-18 00:16

    Update for @AlexLenail comment
    It's a fair point that this will be slow for large lists. I did a little bit of more digging and found that the intersection method is available for Indexes and columns. I'm not sure about the algorithmic complexity but it's much faster empirically.

    You can do something like this.

    good_keys = df.index.intersection(all_keys)
    df.loc[good_keys]
    

    Or like your example

    df = pd.DataFrame([1], index=['1'])
    df.loc[df.index.intersection(['2'])]
    

    Here is a little experiment below

    n = 100000
    
    # Create random values and random string indexes
    # have the bad indexes contain extra values not in DataFrame Index
    rand_val = np.random.rand(n)
    rand_idx = []
    for x in range(n):
        rand_idx.append(str(x))
    
    bad_idx = []
    for x in range(n*2):
        bad_idx.append(str(x))
    
    df = pd.DataFrame(rand_val, index=rand_idx)
    df.head()
    
    def get_valid_keys_list_comp():
        # Return filtered DataFrame using list comprehension to filter keys
        vkeys = [key for key in bad_idx if key in df.index.values]
        return df.loc[vkeys]
    
    def get_valid_keys_intersection():
        # Return filtered DataFrame using list intersection() to filter keys
        vkeys = df.index.intersection(bad_idx)
        return df.loc[vkeys]
    
    %%timeit 
    get_valid_keys_intersection()
    # 64.5 ms ± 4.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    %%timeit 
    get_valid_keys_list_comp()
    # 6.14 s ± 457 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    Original answer

    I'm not sure if pandas has a built-in function to handle this but you can use Python list comprehension to filter to valid indexes with something like this.

    Given a DataFrame df2

               A    B       C   D    F
    test    1.0 2013-01-02  1.0 3   foo
    train   1.0 2013-01-02  1.0 3   foo
    test    1.0 2013-01-02  1.0 3   foo
    train   1.0 2013-01-02  1.0 3   foo
    

    You can filter your index query with this

    keys = ['test', 'train', 'try', 'fake', 'broken']
    valid_keys = [key for key in keys if key in df2.index.values]
    df2.loc[valid_keys]
    

    This will also work for columns if you use df2.columns instead of df2.index.values

    0 讨论(0)
  • 2020-12-18 00:22

    This page https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike has the solution:

    In [8]: pd.DataFrame([1], index=['1']).reindex(['2']) Out[8]: 0 2 NaN

    0 讨论(0)
  • 2020-12-18 00:24

    I found an alternative (provided a check for df.empty is made beforehand). You could do something like this

    df[df.index=='2'] -> returns either a dataframe with matched values or empty dataframe.

    0 讨论(0)
提交回复
热议问题