问题
As part of some analysis, I am trying to find targets that have more than 80% common origins for one-hop paths.
The data is of the kind: all nodes are systems, and the only relationship that is relevant is ConnectsTo
.
So, I can write queries like
match (n:system)-[r:ConnectsTo]->(m:system) return n,m
to get the sources n
for system m
.
I am looking to find all systems m that have 80% or more common source systems.
Please advise how this could be done for all systems. I tried with collect but am afraid I couldn't write the proper iteration.
回答1:
Let's start by creating a simple example data set:
CREATE
(s1:System {name:"s1"}),
(s2:System {name:"s2"}),
(s3:System {name:"s3"}),
(s4:System {name:"s4"}),
(s5:System {name:"s5"}),
(s1)-[:ConnectsTo]->(s3),
(s1)-[:ConnectsTo]->(s4),
(s2)-[:ConnectsTo]->(s3),
(s2)-[:ConnectsTo]->(s4),
(s2)-[:ConnectsTo]->(s5)
This result in the following graph.
We start from node pairs (m1
and m2
) that have at least a single common source. We calculate:
- the number of sources for each node (
sources1Count
andsources2Count
) - the number of common sources (
commonSources
)
Then we compare the number of common sources to the number of sources for the nodes. This could use a bit of fine-tuning, based on what you consider "80% common". The toFloat
function is required to avoid type mismatches.
The query:
MATCH (m1)<-[:ConnectsTo]-()-[:ConnectsTo]->(m2)
MATCH
(n1)-[:ConnectsTo]->(m1),
(n2)-[:ConnectsTo]->(m2)
WITH m1, m2, COUNT(DISTINCT n1) AS sources1Count, COUNT(DISTINCT n2) AS sources2Count
MATCH (m1)<-[:ConnectsTo]-(n)-[:ConnectsTo]->(m2)
WITH m1, m2, sources1Count, sources2Count, COUNT(n) AS commonSources
WHERE
// we only need each m1-m2 pair once
ID(m1) < ID(m2) AND
// similarity
commonSources / 0.8 >= sources1Count AND
commonSources / 0.8 >= sources2Count
RETURN m1, m2
ORDER BY m1.name, m2.name
This gives the following results.
╒══════════╤══════════╕
│m1 │m2 │
╞══════════╪══════════╡
│{name: s3}│{name: s4}│
└──────────┴──────────┘
PS. for checking the similarity, you could use something like:
sources1Count <= toInt(commonSources / 0.8) >= sources2Count
This avoids the duplication of 0.8
but does not look very nice.
Update: an idea from InverseFalcon in the comments: use SIZE
instead of MATCH
and COUNT
MATCH (m1)<-[:ConnectsTo]-()-[:ConnectsTo]->(m2)
WITH m1, m2, SIZE(()-[:ConnectsTo]->(m1)) as sources1Count, SIZE(()-[:ConnectsTo]->(m2)) as sources2Count
MATCH (m1)<-[:ConnectsTo]-(n)-[:ConnectsTo]->(m2)
WITH m1, m2, sources1Count, sources2Count, COUNT(n) AS commonSources
WHERE
// we only need each m1-m2 pair once
ID(m1) < ID(m2) AND
// similarity
commonSources / 0.8 >= sources1Count AND
commonSources / 0.8 >= sources2Count
RETURN m1, m2
ORDER BY m1.name, m2.name
来源:https://stackoverflow.com/questions/40454537/finding-matches-between-start-nodes-for-common-sources-in-neo4j