pyspark generate row hash of specific columns and add it as a new column

拈花ヽ惹草 提交于 2019-12-01 08:26:05

You can use pyspark.sql.functions.concat_ws() to concatenate your columns and pyspark.sql.functions.sha2() to get the SHA256 hash.

Using the data from @gaw:

from pyspark.sql.functions import sha2, concat_ws
df = spark.createDataFrame(
    [(1,"2",5,1),(3,"4",7,8)],
    ("col1","col2","col3","col4")
)
df.withColumn("row_sha2", sha2(concat_ws("||", *df.columns), 256)).show(truncate=False)
#+----+----+----+----+----------------------------------------------------------------+
#|col1|col2|col3|col4|row_sha2                                                        |
#+----+----+----+----+----------------------------------------------------------------+
#|1   |2   |5   |1   |1b0ae4beb8ce031cf585e9bb79df7d32c3b93c8c73c27d8f2c2ddc2de9c8edcd|
#|3   |4   |7   |8   |57f057bdc4178b69b1b6ab9d78eabee47133790cba8cf503ac1658fa7a496db1|
#+----+----+----+----+----------------------------------------------------------------+

You can pass in either 0 or 256 as the second argument to sha2(), as per the docs:

Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512). The numBits indicates the desired bit length of the result, which must have a value of 224, 256, 384, 512, or 0 (which is equivalent to 256).

The function concat_ws takes in a separator, and a list of columns to join. I am passing in || as the separator and df.columns as the list of columns.

I am using all of the columns here, but you can specify whatever subset of columns you'd like- in your case that would be columnarray. (You need to use the * to unpack the list.)

gaw

If you want to have the hash for each value in the different columns of your dataset you can apply a self-designed function via map to the rdd of your dataframe.

import hashlib
test_df = spark.createDataFrame([
    (1,"2",5,1),(3,"4",7,8),              
    ], ("col1","col2","col3","col4"))

def sha_concat(row):
    row_dict = row.asDict()                             #transform row to a dict
    columnarray = row_dict.keys()                       #get the column names
    concat_str = ''
    for v in row_dict.values():
        concat_str = concat_str + '||' + str(v)         #concatenate values
    concat_str = concat_str[2:] 
    row_dict["sha_values"] = concat_str                 #preserve concatenated value for testing (this can be removed later)
    row_dict["sha_hash"] = hashlib.sha256(concat_str).hexdigest() #calculate sha256
    return Row(**row_dict)

test_df.rdd.map(sha_concat).toDF().show(truncate=False)

The Results would look like:

+----+----+----+----+----------------------------------------------------------------+----------+
|col1|col2|col3|col4|sha_hash                                                        |sha_values|
+----+----+----+----+----------------------------------------------------------------+----------+
|1   |2   |5   |1   |1b0ae4beb8ce031cf585e9bb79df7d32c3b93c8c73c27d8f2c2ddc2de9c8edcd|1||2||5||1|
|3   |4   |7   |8   |cb8f8c5d9fd7165cf3c0f019e0fb10fa0e8f147960c715b7f6a60e149d3923a5|8||4||7||3|
+----+----+----+----+----------------------------------------------------------------+----------+
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!