Get the same hash value for a Pandas DataFrame each time

后端未结

关注

 3  1555

My goal is to get unique hash value for a DataFrame. I obtain it out of .csv file. Whole point is to get the same hash each time I call hash() on it.

My idea was t

相关标签:

3条回答

南旧

2020-12-13 18:36

I had a similar problem: check if a dataframe is changed and I solved it by hashing the msgpack serialization string. This seems stable among different reloading the same data.

import pandas as pd
import hashlib
DATA_FILE = 'data.json'

data1 = pd.read_json(DATA_FILE)
data2 = pd.read_json(DATA_FILE)

assert hashlib.md5(data1.to_msgpack()).hexdigest() == hashlib.md5(data2.to_msgpack()).hexdigest()
assert hashlib.md5(data1.values.tobytes()).hexdigest() != hashlib.md5(data2.values.tobytes()).hexdigest()

0 讨论(0)

北荒

2020-12-13 18:42

As of Pandas 0.20.1, you can use the little known (and poorly documented) hash_pandas_object (source code) which was recently made public in pandas.util. It returns one hash value for reach row of the dataframe (and works on series etc. too)

import pandas as pd
import numpy as np

np.random.seed(42)
arr = np.random.choice(['foo', 'bar', 42], size=(3,4))
df = pd.DataFrame(arr)

print(df)
#      0    1   2    3
# 0   42  foo  42   42
# 1  foo  foo  42  bar
# 2   42   42  42   42

from pandas.util import hash_pandas_object
h = hash_pandas_object(df)

print(h)
# 0     5559921529589760079
# 1    16825627446701693880
# 2     7171023939017372657
# dtype: uint64

You can always do hash_pandas_object(df).sum() if you want an overall hash of all rows.

0 讨论(0)

执笔经年

2020-12-13 19:03
Joblib provides a hashing function optimized for objects containing numpy arrays (e.g. pandas dataframes).
```
import joblib
joblib.hash(df)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...