Neo4j Cypher: Finding the largest disconnected subgraph fast

时光毁灭记忆、已成空白 提交于 2019-12-07 12:58:07

问题


I have a graph with one million nodes. There are many disconnected subgraphs within it. I would like to know what is the largest disconnected subgraph.

For instance this in this graph example we got three disconnected subgraph, so for this case the output will be 7.

I tried this but it is taking a long time,

match p = ()-[*]-() return MAX(length(p)) as l order by l desc limit 1

回答1:


Your query will only ever return the longest path between two separate nodes, not the size of the largest connected subgraph.

Unfortunately Neo4j does not currently have any native support for subgraph operations, and I don't think APOC Procedures has anything here either.

There are ways in Cypher to find subgraphs, but the queries I can think of are not fast or performant, and are likely to time out with large graphs. Here's one, and again, this is not recommended, it is likely to time out for you, but if it works, awesome:

MATCH (n)-[*0..]-(subgraphNode)
WITH n, COUNT(DISTINCT subgraphNode) as subSize
RETURN MAX(subSize)

If this is to be a query run often, or every so often, instead of only once, then I'd recommend a means of tracking your subgraphs.

While I can give an approach to creating subgraph tracking, the approach for keeping this updated across graph operations (those that merge subgraphs, divide into smaller subgraphs, or create new subgraphs) is bound to be trickier, and you'll likely need some kind of Java extension to perform post-transaction processing to maintain this.

Also, this approach is best done during a maintenance window when no write operations are occurring.

The end-goal for this is to attach a single :Subgraph node to every disconnected subgraph, which will make future operations on subgraphs much easier, including your case of finding the largest disconnected subgraph.

The overall approach to fulfilling that goal is to first label all nodes in your graph (with a label like :Unprocessed), then, in batched queries for :Unprocessed nodes, find the entire disconnected subgraph they are a part of, attach a single :Subgraph node to it, and then remove the :Unprocessed label from the subgraph.

So, first, label all nodes in your db:

MATCH (n)
SET n:Unprocessed

Next, the batch operation. You'll want to use APOC Procedures to allow batch processing (which will also take advantage of entire subgraphs being removed from the :Unprocessed label as we process them...we don't want to redundantly perform operations on subgraphs).

CALL apoc.periodic.commit("
// only process a batch of :Unproccessed nodes at a time
MATCH (n:Unprocessed)
WITH n LIMIT {limit}
// subgraphNode will be all nodes in the subgraph including n
MATCH (n)-[*0..]-(subgraphNode)
WITH DISTINCT n, subgraphNode
REMOVE subgraphNode:Unprocessed
// find attach point node in each subgraph with smallest id
WITH n, min(id(subgraphNode)) as attachId
WITH DISTINCT attachId
MATCH (attachNode)
WHERE id(attachNode) = attachId
CREATE (attachNode)<-[:SUBGRAPH]-(:Subgraph)
RETURN count(*)
",{limit:100})

You can adjust your limit as necessary. A lower limit might actually work better, as this may reduce redundant operations on nodes of the same subgraph.

Now that all disconnected subgraphs have a :Subgraph node attached, you can make faster and easier queries for each subgraph. So, to find the largest disconnected subgraph, you might use:

MATCH (sub:Subgraph)-[*]-(subgraphNode)
WITH sub, COUNT(DISTINCT subgraphNode) as subSize
RETURN MAX(subSize)

EDIT

I found a faster means of gathering subgraph nodes compared to using a variable relationship match. APOC's Path Expander functionality, using NODE_GLOBAL uniqueness, should perform faster. Here are the relevant queries modified to use this approach.

CALL apoc.periodic.commit("
// only process a batch of :Unproccessed nodes at a time
MATCH (n:Unprocessed)
WITH n LIMIT {limit}
// subgraphNode will be all nodes in the subgraph including n
CALL apoc.path.expandConfig(n,{bfs:true, uniqueness:"NODE_GLOBAL"}) 
  YIELD path
WITH n, LAST(NODES(path)) as subgraphNode
REMOVE subgraphNode:Unprocessed
// find attach point node in each subgraph with smallest id
WITH n, min(id(subgraphNode)) as attachId
WITH DISTINCT attachId
MATCH (attachNode)
WHERE id(attachNode) = attachId
CREATE (attachNode)<-[:SUBGRAPH]-(:Subgraph)
RETURN count(*)
",{limit:100})

And the processing for each subgraph:

MATCH (sub:Subgraph)
CALL apoc.path.expandConfig(sub,{minLevel:1, bfs:true, uniqueness:"NODE_GLOBAL"}) 
  YIELD path
WITH sub, LAST(NODES(path)) as subgraphNode
WITH sub, COUNT(DISTINCT subgraphNode) as subSize
RETURN MAX(subSize)


来源:https://stackoverflow.com/questions/41617574/neo4j-cypher-finding-the-largest-disconnected-subgraph-fast

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!