问题
Greetings to fellow programmer.
I have recently started with pyspark and comes from a pandas background. I need to compute similarity of user in a data against each other. As I couldn't find from pyspark I resorted to use python dictionary to create a similarity dataframe.
However, I run out of ideas to convert a nested dictionary into a pyspark Dataframe. Could you please provide me a direction on to achieve this desired result.
import pyspark
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
from scipy.spatial import distance
spark = SparkSession.builder.getOrCreate()
from pyspark.sql import *
traindf = spark.createDataFrame([
('u11',[1, 2, 3]),
('u12',[4, 5, 6]),
('u13',[7, 8, 9])
]).toDF("user","rating")
traindf.show()
Output
+----+---------+
|user| rating|
+----+---------+
| u11|[1, 2, 3]|
| u12|[4, 5, 6]|
| u13|[7, 8, 9]|
+----+---------+
It want to generate a similarity between user and put it in a pyspark dataframe.
parent_dict = {}
for parent_row in traindf.collect():
# print(parent_row['user'],parent_row['rating'])
child_dict = {}
for child_row in traindf.collect():
similarity = distance.cosine(parent_row['rating'],child_row['rating'])
child_dict[child_row['user']] = similarity
parent_dict[parent_row['user']] = child_dict
print(parent_dict)
Output :
{'u11': {'u11': 0.0, 'u12': 0.0253681538029239, 'u13': 0.0405880544333298},
'u12': {'u11': 0.0253681538029239, 'u12': 0.0, 'u13': 0.001809107314273195},
'u13': {'u11': 0.0405880544333298, 'u12': 0.001809107314273195, 'u13': 0.0}}
From this dictionary I want to construct a pyspark Dataframe.
+-----+-----+--------------------+
|user1|user2| similarity|
+-----+-----+--------------------+
| u11| u11| 0.0|
| u11| u12| 0.0253681538029239|
| u11| u13| 0.0405880544333298|
| u12| u11| 0.0253681538029239|
| u12| u12| 0.0|
| u12| u13|0.001809107314273195|
| u13| u11| 0.0405880544333298|
| u13| u12|0.001809107314273195|
| u13| u13| 0.0|
+-----+-----+--------------------+
What I have tried so far is convert dict to pandas dataframe and convert it to pyspark dataframe. However I need to do this on huge scale and I am looking for more spark-ish way of doing this.
parent_user = []
child_user = []
child_similarity = []
for parent_row in traindf.collect():
for child_row in traindf.collect():
similarity = distance.cosine(parent_row['rating'],child_row['rating'])
child_user.append(child_row['user'])
child_similarity.append(similarity)
parent_user.append(parent_row['user'])
my_dict = {}
my_dict['user1'] = parent_user
my_dict['user2'] = child_user
my_dict['similarity'] = child_similarity
import pandas as pd
pd.DataFrame(my_dict)
df = spark.createDataFrame(pd.DataFrame(my_dict))
df.show()
Output :
+-----+-----+--------------------+
|user1|user2| similarity|
+-----+-----+--------------------+
| u11| u11| 0.0|
| u11| u12| 0.0253681538029239|
| u11| u13| 0.0405880544333298|
| u12| u11| 0.0253681538029239|
| u12| u12| 0.0|
| u12| u13|0.001809107314273195|
| u13| u11| 0.0405880544333298|
| u13| u12|0.001809107314273195|
| u13| u13| 0.0|
+-----+-----+--------------------+
回答1:
maybe you can do something like this:
import pandas as pd
from pyspark.sql import SQLContext
my_dic = {'u11': {'u11': 0.0, 'u12': 0.0253681538029239, 'u13': 0.0405880544333298},
'u12': {'u11': 0.0253681538029239, 'u12': 0.0, 'u13': 0.001809107314273195},
'u13': {'u11': 0.0405880544333298, 'u12': 0.001809107314273195, 'u13': 0.0}}
df = pd.DataFrame.from_dict(my_dic).unstack().to_frame().reset_index()
df.columns = ['user1', 'user2', 'similarity']
sqlCtx = SQLContext(sc) # sc is spark context
sqlCtx.createDataFrame(df).show()
回答2:
Ok, now your question is more clear. I am assuming your are starting with a spark dataframe of user,rating. What your want to do is outer join of this DF with itself, this will create a cross product with all the possible pairs of the users (and their ratings), including lines of same user repeated twice (those can be filtered later) and then calculate new column that contains similarity.
回答3:
from pyspark.sql.types import *
import pyspark.sql.functions as psf
def cos_sim(a,b):
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
dot_udf = psf.udf(lambda x,y: cos_sim(x,y), FloatType())
data.alias("i").join(data.alias("j"), psf.col("i.user") != psf.col("j.user"))\
.select(
psf.col("i.user").alias("user1"),
psf.col("j.user").alias("user2"),
dot_udf("i.rating", "j.rating").alias("similarity"))\
.sort("similarity")\
.show()
Output is as desired:
+-----+-----+----------+
|user1|user2|similarity|
+-----+-----+----------+
| u11| u12|0.70710677|
| u13| u11|0.70710677|
| u11| u13|0.70710677|
| u12| u11|0.70710677|
| u12| u13| 1.0|
| u13| u12| 1.0|
+-----+-----+----------+
来源:https://stackoverflow.com/questions/64631890/convert-nested-dictionary-to-pyspark-dataframe