How to “select distinct” across multiple data frame columns in pandas?

前端 未结 6 484
渐次进展
渐次进展 2020-12-02 14:56

I\'m looking for a way to do the equivalent to the SQL

SELECT DISTINCT col1, col2 FROM dataframe_table

The pandas sql comparison doesn\'t

相关标签:
6条回答
  • 2020-12-02 15:27

    I think use drop duplicate sometimes will not so useful depending dataframe.

    I found this:

    [in] df['col_1'].unique()
    [out] array(['A', 'B', 'C'], dtype=object)
    

    And work for me!

    https://riptutorial.com/pandas/example/26077/select-distinct-rows-across-dataframe

    0 讨论(0)
  • 2020-12-02 15:28

    You can use the drop_duplicates method to get the unique rows in a DataFrame:

    In [29]: df = pd.DataFrame({'a':[1,2,1,2], 'b':[3,4,3,5]})
    
    In [30]: df
    Out[30]:
       a  b
    0  1  3
    1  2  4
    2  1  3
    3  2  5
    
    In [32]: df.drop_duplicates()
    Out[32]:
       a  b
    0  1  3
    1  2  4
    3  2  5
    

    You can also provide the subset keyword argument if you only want to use certain columns to determine uniqueness. See the docstring.

    0 讨论(0)
  • 2020-12-02 15:28

    I've tried different solutions. First was:

    a_df=np.unique(df[['col1','col2']], axis=0)
    

    and it works well for not object data Another way to do this and to avoid error (for object columns type) is to apply drop_duplicates()

    a_df=df.drop_duplicates(['col1','col2'])[['col1','col2']]
    

    You can also use SQL to do this, but it worked very slow in my case:

    from pandasql import sqldf
    q="""SELECT DISTINCT col1, col2 FROM df;"""
    pysqldf = lambda q: sqldf(q, globals())
    a_df = pysqldf(q)
    
    0 讨论(0)
  • 2020-12-02 15:40

    To solve a similar problem, I'm using groupby:

    print(f"Distinct entries: {len(df.groupby(['col1', 'col2']))}")
    

    Whether that's appropriate will depend on what you want to do with the result, though (in my case, I just wanted the equivalent of COUNT DISTINCT as shown).

    0 讨论(0)
  • 2020-12-02 15:40

    You can take the sets of the columns and just subtract the smaller set from the larger set:

    distinct_values = set(df['a'])-set(df['b'])
    
    0 讨论(0)
  • 2020-12-02 15:49

    There is no unique method for a df, if the number of unique values for each column were the same then the following would work: df.apply(pd.Series.unique) but if not then you will get an error. Another approach would be to store the values in a dict which is keyed on the column name:

    In [111]:
    df = pd.DataFrame({'a':[0,1,2,2,4], 'b':[1,1,1,2,2]})
    d={}
    for col in df:
        d[col] = df[col].unique()
    d
    
    Out[111]:
    {'a': array([0, 1, 2, 4], dtype=int64), 'b': array([1, 2], dtype=int64)}
    
    0 讨论(0)
提交回复
热议问题