Convert Nested dictionary to Pyspark Dataframe

问题

Greetings to fellow programmer.

I have recently started with pyspark and comes from a pandas background. I need to compute similarity of user in a data against each other. As I couldn't find from pyspark I resorted to use python dictionary to create a similarity dataframe.

However, I run out of ideas to convert a nested dictionary into a pyspark Dataframe. Could you please provide me a direction on to achieve this desired result.

import pyspark
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
from scipy.spatial import distance


spark = SparkSession.builder.getOrCreate()

from pyspark.sql import *

traindf = spark.createDataFrame([
    ('u11',[1, 2, 3]),
    ('u12',[4, 5, 6]),
    ('u13',[7, 8, 9])
]).toDF("user","rating")

traindf.show()

Output

+----+---------+
|user|   rating|
+----+---------+
| u11|[1, 2, 3]|
| u12|[4, 5, 6]|
| u13|[7, 8, 9]|
+----+---------+

It want to generate a similarity between user and put it in a pyspark dataframe.

parent_dict = {}
for parent_row in traindf.collect():
#     print(parent_row['user'],parent_row['rating'])
    child_dict = {}
    for child_row in traindf.collect():
        similarity = distance.cosine(parent_row['rating'],child_row['rating'])
        child_dict[child_row['user']] = similarity
    parent_dict[parent_row['user']] = child_dict

print(parent_dict)

Output :

{'u11': {'u11': 0.0, 'u12': 0.0253681538029239, 'u13': 0.0405880544333298},
 'u12': {'u11': 0.0253681538029239, 'u12': 0.0, 'u13': 0.001809107314273195},
 'u13': {'u11': 0.0405880544333298, 'u12': 0.001809107314273195, 'u13': 0.0}}

From this dictionary I want to construct a pyspark Dataframe.

+-----+-----+--------------------+
|user1|user2|          similarity|
+-----+-----+--------------------+
|  u11|  u11|                 0.0|
|  u11|  u12|  0.0253681538029239|
|  u11|  u13|  0.0405880544333298|
|  u12|  u11|  0.0253681538029239|
|  u12|  u12|                 0.0|
|  u12|  u13|0.001809107314273195|
|  u13|  u11|  0.0405880544333298|
|  u13|  u12|0.001809107314273195|
|  u13|  u13|                 0.0|
+-----+-----+--------------------+

What I have tried so far is convert dict to pandas dataframe and convert it to pyspark dataframe. However I need to do this on huge scale and I am looking for more spark-ish way of doing this.

parent_user = []
child_user = []
child_similarity = []

for parent_row in traindf.collect():
    
    for child_row in traindf.collect():
        similarity = distance.cosine(parent_row['rating'],child_row['rating'])
        child_user.append(child_row['user'])
        child_similarity.append(similarity)
        parent_user.append(parent_row['user'])

my_dict = {}
my_dict['user1'] = parent_user
my_dict['user2'] = child_user
my_dict['similarity'] = child_similarity

import pandas as pd

pd.DataFrame(my_dict)
df = spark.createDataFrame(pd.DataFrame(my_dict))
df.show()

Output :

+-----+-----+--------------------+
|user1|user2|          similarity|
+-----+-----+--------------------+
|  u11|  u11|                 0.0|
|  u11|  u12|  0.0253681538029239|
|  u11|  u13|  0.0405880544333298|
|  u12|  u11|  0.0253681538029239|
|  u12|  u12|                 0.0|
|  u12|  u13|0.001809107314273195|
|  u13|  u11|  0.0405880544333298|
|  u13|  u12|0.001809107314273195|
|  u13|  u13|                 0.0|
+-----+-----+--------------------+

回答1:

maybe you can do something like this:

import pandas as pd
from pyspark.sql import SQLContext

my_dic = {'u11': {'u11': 0.0, 'u12': 0.0253681538029239, 'u13': 0.0405880544333298},
                 'u12': {'u11': 0.0253681538029239, 'u12': 0.0, 'u13': 0.001809107314273195},
                 'u13': {'u11': 0.0405880544333298, 'u12': 0.001809107314273195, 'u13': 0.0}}

df =  pd.DataFrame.from_dict(my_dic).unstack().to_frame().reset_index()
df.columns = ['user1', 'user2', 'similarity']
sqlCtx = SQLContext(sc) # sc is spark context
sqlCtx.createDataFrame(df).show()

回答2:

Ok, now your question is more clear. I am assuming your are starting with a spark dataframe of user,rating. What your want to do is outer join of this DF with itself, this will create a cross product with all the possible pairs of the users (and their ratings), including lines of same user repeated twice (those can be filtered later) and then calculate new column that contains similarity.

回答3:

from pyspark.sql.types import *
import pyspark.sql.functions as psf

def cos_sim(a,b):
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

dot_udf = psf.udf(lambda x,y: cos_sim(x,y), FloatType())

data.alias("i").join(data.alias("j"), psf.col("i.user") != psf.col("j.user"))\
    .select(
        psf.col("i.user").alias("user1"), 
        psf.col("j.user").alias("user2"), 
        dot_udf("i.rating", "j.rating").alias("similarity"))\
    .sort("similarity")\
    .show()

Output is as desired:

+-----+-----+----------+
|user1|user2|similarity|
+-----+-----+----------+
|  u11|  u12|0.70710677|
|  u13|  u11|0.70710677|
|  u11|  u13|0.70710677|
|  u12|  u11|0.70710677|
|  u12|  u13|       1.0|
|  u13|  u12|       1.0|
+-----+-----+----------+

来源：https://stackoverflow.com/questions/64631890/convert-nested-dictionary-to-pyspark-dataframe

标签

python

pandas

pyspark