Get the same hash value for a Pandas DataFrame each time

后端 未结 3 1555
执念已碎
执念已碎 2020-12-13 18:11

My goal is to get unique hash value for a DataFrame. I obtain it out of .csv file. Whole point is to get the same hash each time I call hash() on it.

My idea was t

相关标签:
3条回答
  • 2020-12-13 18:36

    I had a similar problem: check if a dataframe is changed and I solved it by hashing the msgpack serialization string. This seems stable among different reloading the same data.

    import pandas as pd
    import hashlib
    DATA_FILE = 'data.json'
    
    data1 = pd.read_json(DATA_FILE)
    data2 = pd.read_json(DATA_FILE)
    
    assert hashlib.md5(data1.to_msgpack()).hexdigest() == hashlib.md5(data2.to_msgpack()).hexdigest()
    assert hashlib.md5(data1.values.tobytes()).hexdigest() != hashlib.md5(data2.values.tobytes()).hexdigest()
    
    0 讨论(0)
  • 2020-12-13 18:42

    As of Pandas 0.20.1, you can use the little known (and poorly documented) hash_pandas_object (source code) which was recently made public in pandas.util. It returns one hash value for reach row of the dataframe (and works on series etc. too)

    import pandas as pd
    import numpy as np
    
    np.random.seed(42)
    arr = np.random.choice(['foo', 'bar', 42], size=(3,4))
    df = pd.DataFrame(arr)
    
    print(df)
    #      0    1   2    3
    # 0   42  foo  42   42
    # 1  foo  foo  42  bar
    # 2   42   42  42   42
    
    from pandas.util import hash_pandas_object
    h = hash_pandas_object(df)
    
    print(h)
    # 0     5559921529589760079
    # 1    16825627446701693880
    # 2     7171023939017372657
    # dtype: uint64
    

    You can always do hash_pandas_object(df).sum() if you want an overall hash of all rows.

    0 讨论(0)
  • 2020-12-13 19:03

    Joblib provides a hashing function optimized for objects containing numpy arrays (e.g. pandas dataframes).

    import joblib
    joblib.hash(df)
    
    0 讨论(0)
提交回复
热议问题