问题
Iam trying to create a unique synthetic key after identifying relationships between original keys.
My DataFrame:
Key Value
K1 1
K2 2
K2 3
K1 3
K2 4
K1 5
K3 6
K4 6
K5 7
Expected Result:
Key Value New_Key
K1 1 NK1
K2 2 NK1
K2 3 NK1
K1 3 NK1
K2 4 NK1
K1 5 NK1
K2 6 NK2
K3 6 NK2
K4 7 NK3
I look forward to a response in python 3.0 or pyspark.
I tried it with this code:
#Import libraries#
import networkx as nx
import pandas as pd
#Create DF#
d1=pd.DataFrame({'Key','Value'})
#Create Empty Graph#
G=nx.Graph()
#Create a list of edge tuples#
e=list(d1.iloc[0:].itertuples(index=False, name=None))
#Create a list of nodes/vertices#
v=list(set(d1.A).union(set(d1.B)))
#Add nodes and edges to the graph#
G.add_edges_from(e)
G.add_nodes_from(v)
#Get list connected components#
c=[c for c in sorted(nx.connected_components(G), key=None, reverse=False)] print(c)
Thanks in advance.
回答1:
What you are trying to solve can known as a graph problem called connected components. All you have to do is to treat your Keys
and Values
as vertices and run an connected components algorithm. The following shows you a solution with pyspark and graphframes.
import pyspark.sql.functions as F
from graphframes import *
sc.setCheckpointDir('/tmp/graphframes')
l = [('K1' , 1),
('K2' , 2),
('K2' , 3),
('K1' , 3),
('K2' , 4),
('K1' , 5),
('K3' , 6),
('K4' , 6),
('K5' , 7)]
columns = ['Key', 'Value']
df=spark.createDataFrame(l, columns)
#creating a graphframe
#an edge dataframe requires a src and a dst column
edges = df.withColumnRenamed('Key', 'src')\
.withColumnRenamed('Value', 'dst')
#a vertices dataframe requires a id column
vertices = df.select('Key').union(df.select('value')).withColumnRenamed('Key', 'id')
#this creates a graphframe...
g = GraphFrame(vertices, edges)
#which already has a function called connected components
cC = g.connectedComponents().withColumnRenamed('id', 'Key')
#now we join the connectedComponents dataframe with the original dataframe to add the new keys to it. I'm calling distinct here, as I'm currently getting multiple rows which I can't really explain at the moment
df = df.join(cC, 'Key', 'inner').distinct()
df.show()
Output:
+---+-----+------------+
|Key|Value| component|
+---+-----+------------+
| K3| 6|335007449088|
| K1| 5|154618822656|
| K1| 1|154618822656|
| K1| 3|154618822656|
| K2| 2|154618822656|
| K2| 3|154618822656|
| K2| 4|154618822656|
| K4| 6|335007449088|
| K5| 7| 25769803776|
+---+-----+------------+
来源:https://stackoverflow.com/questions/56704890/generate-synthetic-keys-to-map-many-to-many-relationship