Transforming pandas dataframe column to networkx graph with source and target

白昼怎懂夜的黑 提交于 2020-05-12 03:23:37

问题


I have a DataFrame in pandas with information about people location in time. It is about 300+ million rows.

Here is the sample where each Name is assigned to a unique index by group.by and sorted by Name and Year:

import pandas as pd
inp = [{'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Orange county'}, {'Name': 'John', 'Year':2019, 'Address':'New York'}, {'Name': 'Steve', 'Year':2018, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2019, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2019, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2020, 'Address':'California'}, {'Name': 'Steve', 'Year':2020, 'Address':'Canada'}, {'Name': 'John', 'Year':2020, 'Address':'Canada'}, {'Name': 'John', 'Year':2021, 'Address':'Canada'}, {'Name': 'John', 'Year':2021, 'Address':'Beverly hills'}, {'Name': 'Steve', 'Year':2021, 'Address':'California'}, {'Name': 'Steve', 'Year':2022, 'Address':'California'}, {'Name': 'Steve', 'Year':2018, 'Address':'NewYork'}, {'Name': 'Steve', 'Year':2018, 'Address':'California'}, {'Name': 'Steve', 'Year':2022, 'Address':'NewYork'}]
df = pd.DataFrame(inp)
df['Author_Grouped_Index'] = df.groupby(['Name']).ngroup()
df.sort_values(['Name', 'Year'], ascending=[False, True])

Output:

+-------+-------+------+---------------+----------------------+
| Index | Name  | Year | Address       | Name_Grouped_Index   |
+-------+-------+------+---------------+----------------------+
| 5     | Steve | 2018 | Canada        | 1                    |
+-------+-------+------+---------------+----------------------+
| 15    | Steve | 2018 | NewYork       | 1                    |
+-------+-------+------+---------------+----------------------+
| 16    | Steve | 2018 | California    | 1                    |
+-------+-------+------+---------------+----------------------+
| 6     | Steve | 2019 | Canada        | 1                    |
+-------+-------+------+---------------+----------------------+
| 7     | Steve | 2019 | Canada        | 1                    |
+-------+-------+------+---------------+----------------------+
| 8     | Steve | 2020 | California    | 1                    |
+-------+-------+------+---------------+----------------------+
| 9     | Steve | 2020 | Canada        | 1                    |
+-------+-------+------+---------------+----------------------+
| 13    | Steve | 2021 | California    | 1                    |
+-------+-------+------+---------------+----------------------+
| 14    | Steve | 2022 | California    | 1                    |
+-------+-------+------+---------------+----------------------+
| 17    | Steve | 2022 | NewYork       | 1                    |
+-------+-------+------+---------------+----------------------+
| 0     | John  | 2018 | Beverly hills | 0                    |
+-------+-------+------+---------------+----------------------+
| 1     | John  | 2018 | Beverly hills | 0                    |
+-------+-------+------+---------------+----------------------+
| 2     | John  | 2019 | Beverly hills | 0                    |
+-------+-------+------+---------------+----------------------+
| 3     | John  | 2019 | Orange county | 0                    |
+-------+-------+------+---------------+----------------------+
| 4     | John  | 2019 | New York      | 0                    |
+-------+-------+------+---------------+----------------------+
| 10    | John  | 2020 | Canada        | 0                    |
+-------+-------+------+---------------+----------------------+
| 11    | John  | 2021 | Canada        | 0                    |
+-------+-------+------+---------------+----------------------+
| 12    | John  | 2021 | Beverly hills | 0                    |
+-------+-------+------+---------------+----------------------+

I want to get the network graph matrix (adjacency matrix) where to see the total of changes between Addresses. In other words, for example, how many times people moved from “Canada” to “California” in 2018.

Ideal Outputs:

1) A direct graph from the Address column. Technically converting the Address column into two columns "Source" & "Target" where the "Target" value is the "Source" for the next row. Preferably counting the pairs in another column "Weight" instead of pairs being repeated.

+------------+------------+------+--------+
| Source     | Target     | Year | Weight |
+------------+------------+------+--------+
| Canada     | NewYork    | 2018 |        |
+------------+------------+------+--------+
| NewYork    | California | 2018 |        |
+------------+------------+------+--------+
| California | Canada     | 2019 |        |
+------------+------------+------+--------+
| Canada     | Canada     | 2019 |        |
+------------+------------+------+--------+
| Canada     | California | 2020 |        |
+------------+------------+------+--------+
| California | Canada     | 2020 |        |
+------------+------------+------+--------+
| Canada     | California | 2021 |        |
+------------+------------+------+--------+
| California | California | 2022 |        |
+------------+------------+------+--------+
| California | NewYork    | 2022 |        |
+------------+------------+------+--------+

OR

2) A matrix to illustrate the total changes between addresses.

+---------------+--------+---------+------------+---------------+---------------+
| From \ To     | Canada | NewYork | California | Beverly hills | Orange county |
+---------------+--------+---------+------------+---------------+---------------+
| Canada        | 2      | 2       | 2          | 2             | 0             |
+---------------+--------+---------+------------+---------------+---------------+
| NewYork       | 1      | 0       | 1          | 0             | 0             |
+---------------+--------+---------+------------+---------------+---------------+
| California    | 2      | 1       | 1          | 0             | 0             |
+---------------+--------+---------+------------+---------------+---------------+
| Beverly hills | 0      | 0       | 0          | 2             | 1             |
+---------------+--------+---------+------------+---------------+---------------+
| Orange county | 0      | 1       | 0          | 0             | 0             |
+---------------+--------+---------+------------+---------------+---------------+

回答1:


This is not the prettiest code but at least you can follow each step. I've gone for the second option because you can easily make your graph from this connections matrix. Do you need help with making the networkx graph? The rows and columns of the matrix are : ['Beverly hills', 'Orange county', 'New York', 'Canada', 'California', 'NewYork'] You've spelled newyork differently for each person so it comes up twice.

import pandas as pd
inp = [{'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2018, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Beverly hills'}, {'Name': 'John', 'Year':2019, 'Address':'Orange county'}, {'Name': 'John', 'Year':2019, 'Address':'New York'}, {'Name': 'Steve', 'Year':2018, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2019, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2019, 'Address':'Canada'}, {'Name': 'Steve', 'Year':2020, 'Address':'California'}, {'Name': 'Steve', 'Year':2020, 'Address':'Canada'}, {'Name': 'John', 'Year':2020, 'Address':'Canada'}, {'Name': 'John', 'Year':2021, 'Address':'Canada'}, {'Name': 'John', 'Year':2021, 'Address':'Beverly hills'}, {'Name': 'Steve', 'Year':2021, 'Address':'California'}, {'Name': 'Steve', 'Year':2022, 'Address':'California'}, {'Name': 'Steve', 'Year':2018, 'Address':'NewYork'}, {'Name': 'Steve', 'Year':2018, 'Address':'California'}, {'Name': 'Steve', 'Year':2022, 'Address':'NewYork'}]
df = pd.DataFrame(inp)
df['Author_Grouped_Index'] = df.groupby(['Name']).ngroup()
df.sort_values(['Name', 'Year'], ascending=[False, True])

print (df)
dictionary_ = {} # where each person went
places = [] # all of the places
for index, row in df.iterrows():
    if row['Author_Grouped_Index'] not in dictionary_:
        dictionary_[row['Author_Grouped_Index']] = []
        dictionary_[row['Author_Grouped_Index']].append(row["Address"])
    else:
        dictionary_[row['Author_Grouped_Index']].append(row["Address"])
    if row["Address"] not in places:
        places.append(row["Address"])


print (dictionary_)

new_dictionary = {} #number of times each place visited
for key, value in dictionary_.items():
    for x in range(len(value)-1):
        move = value[x] + "-" + value[x+1]
        if not move in new_dictionary:
            new_dictionary[move] = 1
        else:
            new_dictionary[move] += 1

print (new_dictionary)
print (places)
import numpy as np
array = np.zeros((len(places),len(places)), dtype=int)
for x, place in enumerate(places):
    for y, place_2 in enumerate(places):

        move_2 = (place + "-" + place_2)
        try:
            array[x,y] = (new_dictionary[move_2])
        except:
            array[x,y] = 0

print (array)


来源:https://stackoverflow.com/questions/61307877/transforming-pandas-dataframe-column-to-networkx-graph-with-source-and-target

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!