Extract edge and communities from list of nodes

醉酒当歌 提交于 2019-12-01 17:02:55

Use Pandas to get the data into a pairwise node listing, where each row represents an edge, based on your edge criteria. Then migrate into a networkx object for graph analysis.

The criteria for two nodes sharing an edge include:

  1. Same location Assuming this means same gps1 AND gps2.
  2. "Near same start and end time" This is a little ambiguous. For the purposes of this answer I've reduced this criterion to "start time in the same 5-second interval". It shouldn't be too hard to extend the groupby approach I've taken here if you want to apply additional temporal conditions on edges.

Since we want to manipulate data based on timestamps, convert start and end to datetime dtype:

df.start = pd.to_datetime(df.start, unit="s")
df.end = pd.to_datetime(df.end, unit="s")

df.start.describe()
count                      35
unique                     11
top       2004-01-05 00:00:13
freq                        8
first     2004-01-05 00:00:01
last      2004-01-05 00:00:26
Name: start, dtype: object

df.head()
             ID               start                 end    gps1    gps2
0   0022d9064bc 2004-01-05 00:00:01 2004-01-05 00:00:03  819251  440006
1  00022d9064bc 2004-01-05 00:00:03 2004-01-05 00:00:10  819213  439954
2  00904b4557d3 2004-01-05 00:00:03 2004-01-05 00:18:40  817526  439458
3  00022de73863 2004-01-05 00:00:04 2004-01-05 01:16:50  817558  439525
4  00904b14b494 2004-01-05 00:00:04 2004-01-05 00:30:25  817558  439525

The sample observations happen within a few seconds of each other, so we'll set the grouping frequency to be only a few seconds:

near = "5s" 

Now groupby location and start time to find connected nodes:

edges = (df.groupby(["gps1",
                     "gps2",
                     pd.Grouper(key="start", 
                                freq=near, 
                                closed="right", 
                                label="right")], 
                   as_index=False)
           .agg({"ID":','.join,
                 "start":"min",
                 "end":"max"})
            .reset_index()
            .rename(columns={"index":"edge",
                             "start":"start_min", 
                             "end":"end_max"})
        )

edges.ID = edges.ID.str.split(",")

edges.head():

   edge    gps1    gps2                                                 ID  \
0     0  817526  439458                                     [00904b4557d3]   
1     1  817558  439525  [00022de73863, 00904b14b494, 00904b14b494, 009...   
2     2  817558  439525         [00022de73863, 00904b14b494, 00904b312d9e]   
3     3  817721  439564  [00022d176cf3, 000c30d8d2e8, 00904b243bc4, 009...   
4     4  817735  439757                       [003065d2d8b6, 00904b0c7856]   

            start_min             end_max  
0 2004-01-05 00:00:03 2004-01-05 00:18:40  
1 2004-01-05 00:00:04 2004-01-05 01:16:50  
2 2004-01-05 00:00:25 2004-01-05 00:01:19  
3 2004-01-05 00:00:13 2004-01-05 00:02:42  
4 2004-01-05 00:00:17 2004-01-05 01:52:40 

Each row now represents a unique edge category. ID is a list of nodes in that all share that edge. It's a bit tricky to get this list into new structure of node-pairs; I've resorted to some old-fashioned nested for-loops. There's likely some Pandas-fu that can improve efficiency here:

Note: In the case of a singleton node, I've assigned a None value to its pair. If you don't want to track singletons, just ignore the if not len(combos): ... logic.

pairs = []
idx = 0
for e in edges.edge.values:
    nodes = edges.loc[edges.edge==e, "ID"].values[0]
    attrs = edges.loc[edges.edge==e, ["gps1","gps2","start_min","end_max"]]
    combos = list(combinations(nodes, 2))
    if not len(combos):
        pair = [e, nodes[0], None]
        pair.extend(attrs.values[0])
        pairs.append(pair)
        idx += 1
    else:
        for combo in combos:
            pair = [e, combo[0], combo[1]]
            pair.extend(attrs.values[0])
            pairs.append(pair)
            idx += 1
cols = ["edge","nodeA","nodeB","gps1","gps2","start_min","end_max"]
pairs_df = pd.DataFrame(pairs, columns=cols)    

pairs_df.head():

   edge         nodeA         nodeB    gps1    gps2           start_min  \
0     0  00904b4557d3          None  817526  439458 2004-01-05 00:00:03   
1     1  00022de73863  00904b14b494  817558  439525 2004-01-05 00:00:04   
2     1  00022de73863  00904b14b494  817558  439525 2004-01-05 00:00:04   
3     1  00022de73863  00904b14b494  817558  439525 2004-01-05 00:00:04   
4     1  00904b14b494  00904b14b494  817558  439525 2004-01-05 00:00:04   

              end_max  
0 2004-01-05 00:18:40  
1 2004-01-05 01:16:50  
2 2004-01-05 01:16:50  
3 2004-01-05 01:16:50  
4 2004-01-05 01:16:50      

Now the data can be fit to a networkx object:

import networkx as nx

g = nx.from_pandas_dataframe(pairs_df, "nodeA", "nodeB", edge_attr=True)

# access edge attributes by node pairing:
test_A = "00022de73863"
test_B = "00904b14b494"
g[test_A][test_B]["start_min"]
# output:
Timestamp('2004-01-05 00:00:25')

For community detection, there are several options. Consider the networkx community algorithms, as well as the community module, which builds off of native networkx functionality.

I read your question as mainly concerned with manipulating your data into a format suitable for network analysis. As this answer is lengthy enough already, I'll leave it to you to pursue community detection strategies - several methods can be used out-of-the-box with the modules I've linked to here.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!