Finding matches between start nodes for common sources in neo4j

有些话、适合烂在心里 提交于 2020-01-05 03:34:07

问题


As part of some analysis, I am trying to find targets that have more than 80% common origins for one-hop paths.

The data is of the kind: all nodes are systems, and the only relationship that is relevant is ConnectsTo.

So, I can write queries like

match (n:system)-[r:ConnectsTo]->(m:system) return n,m

to get the sources n for system m.

I am looking to find all systems m that have 80% or more common source systems.

Please advise how this could be done for all systems. I tried with collect but am afraid I couldn't write the proper iteration.


回答1:


Let's start by creating a simple example data set:

CREATE
  (s1:System {name:"s1"}), 
  (s2:System {name:"s2"}), 
  (s3:System {name:"s3"}), 
  (s4:System {name:"s4"}), 
  (s5:System {name:"s5"}), 
  (s1)-[:ConnectsTo]->(s3),
  (s1)-[:ConnectsTo]->(s4),
  (s2)-[:ConnectsTo]->(s3),
  (s2)-[:ConnectsTo]->(s4),
  (s2)-[:ConnectsTo]->(s5)

This result in the following graph.

We start from node pairs (m1 and m2) that have at least a single common source. We calculate:

  • the number of sources for each node (sources1Count and sources2Count)
  • the number of common sources (commonSources)

Then we compare the number of common sources to the number of sources for the nodes. This could use a bit of fine-tuning, based on what you consider "80% common". The toFloat function is required to avoid type mismatches.

The query:

MATCH (m1)<-[:ConnectsTo]-()-[:ConnectsTo]->(m2)
MATCH
  (n1)-[:ConnectsTo]->(m1),
  (n2)-[:ConnectsTo]->(m2)
WITH m1, m2, COUNT(DISTINCT n1) AS sources1Count, COUNT(DISTINCT n2) AS sources2Count
MATCH (m1)<-[:ConnectsTo]-(n)-[:ConnectsTo]->(m2)
WITH m1, m2, sources1Count, sources2Count, COUNT(n) AS commonSources
WHERE
  // we only need each m1-m2 pair once
  ID(m1) < ID(m2) AND
  // similarity
  commonSources / 0.8 >= sources1Count AND
  commonSources / 0.8 >= sources2Count
RETURN m1, m2
ORDER BY m1.name, m2.name

This gives the following results.

╒══════════╤══════════╕
│m1        │m2        │
╞══════════╪══════════╡
│{name: s3}│{name: s4}│
└──────────┴──────────┘

PS. for checking the similarity, you could use something like:

sources1Count <= toInt(commonSources / 0.8) >= sources2Count

This avoids the duplication of 0.8 but does not look very nice.

Update: an idea from InverseFalcon in the comments: use SIZE instead of MATCH and COUNT

MATCH (m1)<-[:ConnectsTo]-()-[:ConnectsTo]->(m2)
WITH m1, m2, SIZE(()-[:ConnectsTo]->(m1)) as sources1Count, SIZE(()-[:ConnectsTo]->(m2)) as sources2Count
MATCH (m1)<-[:ConnectsTo]-(n)-[:ConnectsTo]->(m2)
WITH m1, m2, sources1Count, sources2Count, COUNT(n) AS commonSources
WHERE
    // we only need each m1-m2 pair once
    ID(m1) < ID(m2) AND
    // similarity
    commonSources / 0.8 >= sources1Count AND
    commonSources / 0.8 >= sources2Count
RETURN m1, m2
ORDER BY m1.name, m2.name


来源:https://stackoverflow.com/questions/40454537/finding-matches-between-start-nodes-for-common-sources-in-neo4j

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!