Get the same hash value for a Pandas DataFrame each time

后端 未结 3 1558
执念已碎
执念已碎 2020-12-13 18:11

My goal is to get unique hash value for a DataFrame. I obtain it out of .csv file. Whole point is to get the same hash each time I call hash() on it.

My idea was t

3条回答
  •  北荒
    北荒 (楼主)
    2020-12-13 18:42

    As of Pandas 0.20.1, you can use the little known (and poorly documented) hash_pandas_object (source code) which was recently made public in pandas.util. It returns one hash value for reach row of the dataframe (and works on series etc. too)

    import pandas as pd
    import numpy as np
    
    np.random.seed(42)
    arr = np.random.choice(['foo', 'bar', 42], size=(3,4))
    df = pd.DataFrame(arr)
    
    print(df)
    #      0    1   2    3
    # 0   42  foo  42   42
    # 1  foo  foo  42  bar
    # 2   42   42  42   42
    
    from pandas.util import hash_pandas_object
    h = hash_pandas_object(df)
    
    print(h)
    # 0     5559921529589760079
    # 1    16825627446701693880
    # 2     7171023939017372657
    # dtype: uint64
    

    You can always do hash_pandas_object(df).sum() if you want an overall hash of all rows.

提交回复
热议问题