My goal is to get unique hash value for a DataFrame. I obtain it out of .csv file. Whole point is to get the same hash each time I call hash() on it.
My idea was t
I had a similar problem: check if a dataframe is changed and I solved it by hashing the msgpack serialization string. This seems stable among different reloading the same data.
import pandas as pd
import hashlib
DATA_FILE = 'data.json'
data1 = pd.read_json(DATA_FILE)
data2 = pd.read_json(DATA_FILE)
assert hashlib.md5(data1.to_msgpack()).hexdigest() == hashlib.md5(data2.to_msgpack()).hexdigest()
assert hashlib.md5(data1.values.tobytes()).hexdigest() != hashlib.md5(data2.values.tobytes()).hexdigest()
As of Pandas 0.20.1, you can use the little known (and poorly documented) hash_pandas_object
(source code) which was recently made public in pandas.util
. It returns one hash value for reach row of the dataframe (and works on series etc. too)
import pandas as pd
import numpy as np
np.random.seed(42)
arr = np.random.choice(['foo', 'bar', 42], size=(3,4))
df = pd.DataFrame(arr)
print(df)
# 0 1 2 3
# 0 42 foo 42 42
# 1 foo foo 42 bar
# 2 42 42 42 42
from pandas.util import hash_pandas_object
h = hash_pandas_object(df)
print(h)
# 0 5559921529589760079
# 1 16825627446701693880
# 2 7171023939017372657
# dtype: uint64
You can always do hash_pandas_object(df).sum()
if you want an overall hash of all rows.
Joblib provides a hashing function optimized for objects containing numpy arrays (e.g. pandas dataframes).
import joblib
joblib.hash(df)