I\'m interested in calculating the Jaccard similarity metric for all pairs of vertices in a graph that are not directly connected. The Jaccard metric is defined as the norm of
Let's do it step by step:
Find pairs of vertices and also collect their respective neighbors:
g.V().match(
__.as('v1').out().dedup().fold().as('v1n'),
__.as('v1').V().as('v2'),
__.as('v2').out().dedup().fold().as('v2n')).
where('v1', neq('v2'))
Make sure that v1 is not a neighbor of v2 and vice versa:
g.V().match(
__.as('v1').out().dedup().fold().as('v1n'),
__.as('v1').V().as('v2'),
__.as('v2').out().dedup().fold().as('v2n')).
where('v1', neq('v2').and(without('v2n'))).
where('v2', without('v1n'))
Next, compute the number of intersecting neighbors and the total number of neighbors:
g.V().match(
__.as('v1').out().dedup().fold().as('v1n'),
__.as('v1').V().as('v2'),
__.as('v2').out().dedup().fold().as('v2n')).
where('v1', neq('v2').and(without('v2n'))).
where('v2', without('v1n')).as('m').
project('v1','v2','i','u').
by(select('v1')).
by(select('v2')).
by(select('v1n').as('n').
select('m').
select('v2n').unfold().
where(within('n')).
count()).
by(union(select('v1n'),
select('v2n')).unfold().
dedup().count())
And finally, compute the Jaccard similarity by dividing i by u (also make sure that vertices without neighbors get filtered out to prevent divisions by 0):
g.V().match(
__.as('v1').out().dedup().fold().as('v1n'),
__.as('v1').V().as('v2'),
__.as('v2').out().dedup().fold().as('v2n')).
where('v1', neq('v2').and(without('v2n'))).
where('v2', without('v1n')).as('m').
project('v1','v2','i','u').
by(select('v1')).
by(select('v2')).
by(select('v1n').as('n').
select('m').
select('v2n').unfold().
where(within('n')).
count()).
by(union(select('v1n'),
select('v2n')).unfold().
dedup().count()).
filter(select('u').is(gt(0))).
project('v1','v2','j').
by(select('v1')).
by(select('v2')).
by(math('i/u'))
One last thing: Since comparing vertex v1 and v2 is the same as comparing v2 and v1, the query only needs to consider one case. One way to do that is by making sure that v1's id is smaller than v2's id:
g.V().match(
__.as('v1').out().dedup().fold().as('v1n'),
__.as('v1').V().as('v2'),
__.as('v2').out().dedup().fold().as('v2n')).
where('v1', lt('v2')).
by(id).
where('v1', without('v2n')).
where('v2', without('v1n')).as('m').
project('v1','v2','i','u').
by(select('v1')).
by(select('v2')).
by(select('v1n').as('n').
select('m').
select('v2n').unfold().
where(within('n')).
count()).
by(union(select('v1n'),
select('v2n')).unfold().
dedup().count()).
filter(select('u').is(gt(0))).
project('v1','v2','j').
by(select('v1')).
by(select('v2')).
by(math('i/u'))
Executing this traversal over the modern toy graph yields the following result:
gremlin> g = TinkerFactory.createModern().traversal()
==>graphtraversalsource[tinkergraph[vertices:6 edges:6], standard]
gremlin> g.V().match(
......1> __.as('v1').out().dedup().fold().as('v1n'),
......2> __.as('v1').V().as('v2'),
......3> __.as('v2').out().dedup().fold().as('v2n')).
......4> where('v1', lt('v2')).
......5> by(id).
......6> where('v1', without('v2n')).
......7> where('v2', without('v1n')).as('m').
......8> project('v1','v2','i','u').
......9> by(select('v1')).
.....10> by(select('v2')).
.....11> by(select('v1n').as('n').
.....12> select('m').
.....13> select('v2n').unfold().
.....14> where(within('n')).
.....15> count()).
.....16> by(union(select('v1n'),
.....17> select('v2n')).unfold().
.....18> dedup().count()).
.....19> filter(select('u').is(gt(0))).
.....20> project('v1','v2','j').
.....21> by(select('v1')).
.....22> by(select('v2')).
.....23> by(math('i/u'))
==>[v1:v[1],v2:v[5],j:0.0]
==>[v1:v[1],v2:v[6],j:0.3333333333333333]
==>[v1:v[2],v2:v[4],j:0.0]
==>[v1:v[2],v2:v[6],j:0.0]
==>[v1:v[4],v2:v[6],j:0.5]
==>[v1:v[5],v2:v[6],j:0.0]