gremlin syntax to calculate Jaccard similarity metric

允我心安 提交于 2019-12-02 11:48:50

Let's do it step by step:

Find pairs of vertices and also collect their respective neighbors:

g.V().match(       __.as('v1').out().dedup().fold().as('v1n'),       __.as('v1').V().as('v2'),       __.as('v2').out().dedup().fold().as('v2n')).     where('v1', neq('v2')) 

Make sure that v1 is not a neighbor of v2 and vice versa:

g.V().match(       __.as('v1').out().dedup().fold().as('v1n'),       __.as('v1').V().as('v2'),       __.as('v2').out().dedup().fold().as('v2n')).     where('v1', neq('v2').and(without('v2n'))).     where('v2', without('v1n')) 

Next, compute the number of intersecting neighbors and the total number of neighbors:

g.V().match(       __.as('v1').out().dedup().fold().as('v1n'),       __.as('v1').V().as('v2'),       __.as('v2').out().dedup().fold().as('v2n')).     where('v1', neq('v2').and(without('v2n'))).     where('v2', without('v1n')).as('m').   project('v1','v2','i','u').     by(select('v1')).     by(select('v2')).     by(select('v1n').as('n').        select('m').        select('v2n').unfold().          where(within('n')).        count()).     by(union(select('v1n'),              select('v2n')).unfold().        dedup().count()) 

And finally, compute the Jaccard similarity by dividing i by u (also make sure that vertices without neighbors get filtered out to prevent divisions by 0):

g.V().match(       __.as('v1').out().dedup().fold().as('v1n'),       __.as('v1').V().as('v2'),       __.as('v2').out().dedup().fold().as('v2n')).     where('v1', neq('v2').and(without('v2n'))).     where('v2', without('v1n')).as('m').   project('v1','v2','i','u').     by(select('v1')).     by(select('v2')).     by(select('v1n').as('n').        select('m').        select('v2n').unfold().          where(within('n')).        count()).     by(union(select('v1n'),              select('v2n')).unfold().        dedup().count()).   filter(select('u').is(gt(0))).   project('v1','v2','j').     by(select('v1')).     by(select('v2')).     by(math('i/u')) 

One last thing: Since comparing vertex v1 and v2 is the same as comparing v2 and v1, the query only needs to consider one case. One way to do that is by making sure that v1's id is smaller than v2's id:

g.V().match(       __.as('v1').out().dedup().fold().as('v1n'),       __.as('v1').V().as('v2'),       __.as('v2').out().dedup().fold().as('v2n')).     where('v1', lt('v2')).       by(id).     where('v1', without('v2n')).     where('v2', without('v1n')).as('m').   project('v1','v2','i','u').     by(select('v1')).     by(select('v2')).     by(select('v1n').as('n').        select('m').        select('v2n').unfold().          where(within('n')).        count()).     by(union(select('v1n'),              select('v2n')).unfold().        dedup().count()).   filter(select('u').is(gt(0))).   project('v1','v2','j').     by(select('v1')).     by(select('v2')).     by(math('i/u')) 

Executing this traversal over the modern toy graph yields the following result:

gremlin> g = TinkerFactory.createModern().traversal() ==>graphtraversalsource[tinkergraph[vertices:6 edges:6], standard] gremlin> g.V().match( ......1>       __.as('v1').out().dedup().fold().as('v1n'), ......2>       __.as('v1').V().as('v2'), ......3>       __.as('v2').out().dedup().fold().as('v2n')). ......4>     where('v1', lt('v2')). ......5>       by(id). ......6>     where('v1', without('v2n')). ......7>     where('v2', without('v1n')).as('m'). ......8>   project('v1','v2','i','u'). ......9>     by(select('v1')). .....10>     by(select('v2')). .....11>     by(select('v1n').as('n'). .....12>        select('m'). .....13>        select('v2n').unfold(). .....14>          where(within('n')). .....15>        count()). .....16>     by(union(select('v1n'), .....17>              select('v2n')).unfold(). .....18>        dedup().count()). .....19>   filter(select('u').is(gt(0))). .....20>   project('v1','v2','j'). .....21>     by(select('v1')). .....22>     by(select('v2')). .....23>     by(math('i/u')) ==>[v1:v[1],v2:v[5],j:0.0] ==>[v1:v[1],v2:v[6],j:0.3333333333333333] ==>[v1:v[2],v2:v[4],j:0.0] ==>[v1:v[2],v2:v[6],j:0.0] ==>[v1:v[4],v2:v[6],j:0.5] ==>[v1:v[5],v2:v[6],j:0.0] 
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!