Gremlin : AWS Neptune - Get all Leaf Nodes for each Node in the Graph as CSV

余生颓废 提交于 2021-01-28 07:23:10

问题


I have a simple graph that has Nodes which represent duplicate record id in the below form

Duplicate Id, Original Id
A,B
B,C
C,D
X,Y
Y,Z

The directed graph looks like A -> B ->C ->D and I want CSV result that looks like below that has each Node with ultimate leaf node with no more outgoing edges

A,D
B,D
C,D
X,Z
Y,Z

The above is a simplistic scenario to explain the problem however actual data will have more complex scenarios like below where I have 24 Nodes from A to X with each node connect to other 23 having 24C2=276 directed edges and for each of the 24 nodes I require the ultimate node that has no more outgoing edges.

A,B
A,C
A,D
A,E
A,F
A,G
A,H
A,I
A,J
A,K
A,L
A,M
A,N
A,O
A,P
A,Q
A,R
A,S
A,T
A,U
A,V
A,W
A,X
B,C
B,D
B,E
B,F
B,G
B,H
B,I
B,J
B,K
B,L
B,M
B,N
B,O
B,P
B,Q
B,R
B,S
B,T
B,U
B,V
B,W
B,X
C,D
C,E
C,F
C,G
C,H
C,I
C,J
C,K
C,L
C,M
C,N
C,O
C,P
C,Q
C,R
C,S
C,T
C,U
C,V
C,W
C,X
D,E
D,F
D,G
D,H
D,I
D,J
D,K
D,L
D,M
D,N
D,O
D,P
D,Q
D,R
D,S
D,T
D,U
D,V
D,W
D,X
E,F
E,G
E,H
E,I
E,J
E,K
E,L
E,M
E,N
E,O
E,P
E,Q
E,R
E,S
E,T
E,U
E,V
E,W
E,X
F,G
F,H
F,I
F,J
F,K
F,L
F,M
F,N
F,O
F,P
F,Q
F,R
F,S
F,T
F,U
F,V
F,W
F,X
G,H
G,I
G,J
G,K
G,L
G,M
G,N
G,O
G,P
G,Q
G,R
G,S
G,T
G,U
G,V
G,W
G,X
H,I
H,J
H,K
H,L
H,M
H,N
H,O
H,P
H,Q
H,R
H,S
H,T
H,U
H,V
H,W
H,X
I,J
I,K
I,L
I,M
I,N
I,O
I,P
I,Q
I,R
I,S
I,T
I,U
I,V
I,W
I,X
J,K
J,L
J,M
J,N
J,O
J,P
J,Q
J,R
J,S
J,T
J,U
J,V
J,W
J,X
K,L
K,M
K,N
K,O
K,P
K,Q
K,R
K,S
K,T
K,U
K,V
K,W
K,X
L,M
L,N
L,O
L,P
L,Q
L,R
L,S
L,T
L,U
L,V
L,W
L,X
M,N
M,O
M,P
M,Q
M,R
M,S
M,T
M,U
M,V
M,W
M,X
N,O
N,P
N,Q
N,R
N,S
N,T
N,U
N,V
N,W
N,X
O,P
O,Q
O,R
O,S
O,T
O,U
O,V
O,W
O,X
P,Q
P,R
P,S
P,T
P,U
P,V
P,W
P,X
Q,R
Q,S
Q,T
Q,U
Q,V
Q,W
Q,X
R,S
R,T
R,U
R,V
R,W
R,X
S,T
S,U
S,V
S,W
S,X
T,U
T,V
T,W
T,X
U,V
U,W
U,X
V,W
V,X
W,X

Here is a graphical visualization using Neo4j -

For the above case each Node A to W will have X as the ultimate Leaf Node.

There will also be Cyclic loops which I need to avoid in the overall solution. It may be too much to have everything in one step however will appreciate guidance.

Update : 2020-10-15 Traversal Optimization need to optimize execution in finding Path from starting Node to Leaf Node

For the Data Scenario below, The result for Vertex A and B should be

A,G
A,H
B,G
B,H

The shortest path from A to G is A->C->E->G ; is it possible to suppress any further traversals from A to G when any 1 shortest path to a leaf node is found ? else it will lead to unwanted execution especially for larger clusters of connected Nodes. This steps needs to be repeated again from A to H whose path will be A->C->F->H and then prevent any further attempts to find paths between A and H.

Thanks


回答1:


Based on what you have provided, the following graph can be built.

g.addV('A').as('a').
  addV('B').as('b').
  addV('C').as('c').
  addV('D').as('d').
  addV('X').as('x').
  addV('Y').as('y').
  addV('Z').as('z').
  addE('dup').from('a').to('b').
  addE('dup').from('b').to('c').
  addE('dup').from('c').to('d').
  addE('dup').from('x').to('y').
  addE('dup').from('y').to('z').
  iterate()  

and the following query used to find the results

gremlin> g.V().
           repeat(out().simplePath()).
           until(__.not(out())).
           path().
             by(label).
           local(
             unfold().
             union(
               limit(1),
               tail()).
             fold())


==>[A,D]
==>[B,D]
==>[C,D]
==>[X,Z]
==>[Y,Z] 

UPDATED Oct 11th 2020

Using the larger data set you provided, I tweaked the query a bit so that for each starting vertex only one path to a leaf is found. This runs very quickly. Without doing this, starting at say 'A', there are literally millions of unique paths that end up at 'X' which is why the previous query becomes so complex.

gremlin> g.V().
......1>   local(
......2>     repeat(out().simplePath()).
......3>     until(__.not(out())).
......4>     path().
......5>       by(label).
......6>       limit(1)).
......7>   local(
......8>     unfold().
......9>     union(
.....10>       limit(1),
.....11>       tail()).
.....12>     fold())

==>[A,X]
==>[B,X]
==>[C,X]
==>[D,X]
==>[E,X]
==>[F,X]
==>[G,X]
==>[H,X]
==>[I,X]
==>[J,X]
==>[K,X]
==>[L,X]
==>[M,X]
==>[N,X]
==>[O,X]
==>[P,X]
==>[Q,X]
==>[R,X]
==>[S,X]
==>[T,X]
==>[U,X]
==>[V,X]
==>[W,X]

Purely for interest the following query shows the highly connected fan out of the graph.

gremlin> g.V().group().by(label).by(local(out().label().order().fold())).unfold()

==>A=[B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>B=[C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>C=[D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>D=[E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>E=[F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>F=[G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>G=[H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>H=[I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>I=[J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>J=[K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>K=[L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>L=[M, N, O, P, Q, R, S, T, U, V, W, X]
==>M=[N, O, P, Q, R, S, T, U, V, W, X]
==>N=[O, P, Q, R, S, T, U, V, W, X]
==>O=[P, Q, R, S, T, U, V, W, X]
==>P=[Q, R, S, T, U, V, W, X]
==>Q=[R, S, T, U, V, W, X]
==>R=[S, T, U, V, W, X]
==>S=[T, U, V, W, X]
==>T=[U, V, W, X]
==>U=[V, W, X]
==>V=[W, X]
==>W=[X]
==>X=[]

and the counts (multiplying out those numbers gives a large number) which explains why finding all paths from 'A' is an expensive query. Note that the simplePath step helps us by making sure we do not follow any cycles. The number of paths from any vertex to 'X' in this data set ends up being 2^(C-1) where C is the number in the list below for a given start.

gremlin> g.V().group().by(label).by(local(out().count())).unfold()

==>A=23
==>B=22
==>C=21
==>D=20
==>E=19
==>F=18
==>G=17
==>H=16
==>I=15
==>J=14
==>K=13
==>L=12
==>M=11
==>N=10
==>O=9
==>P=8
==>Q=7
==>R=6
==>S=5
==>T=4
==>U=3
==>V=2
==>W=1
==>X=0


来源:https://stackoverflow.com/questions/63999148/gremlin-aws-neptune-get-all-leaf-nodes-for-each-node-in-the-graph-as-csv

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!