问题
I'm trying to programmatically locate all duplicate nodes in a Neo4j 1.8 database (using Neo4j 1.8).
The nodes that need examination all have a (non-indexed) property externalId
for which I want to find duplicates of. This is the Cypher query I've got:
START n=node(*), dup=node(*) WHERE
HAS(n.externalId) AND HAS(dup.externalId) AND
n.externalId=dup.externalId AND
ID(n) < ID(dup)
RETURN dup
There are less than 10K nodes in the data and less than 1K nodes with an externalId
.
The query above is working but seems to perform badly. Is there a less memory consuming way to do this?
回答1:
Try this query:
START n=node(*)
WHERE HAS(n.externalId)
WITH n.externalId AS extId, COLLECT(n) AS cn
WHERE LENGTH(cn) > 1
RETURN extId, cn;
It avoids taking the Cartesian product of your nodes. It finds the distinct externalId
values, collects all the nodes with the same id, and then filters out the non-duplicated ids. Each row in the result will contain an externalId and a collection of the duplicate nodes with that id.
回答2:
The start clause consists of a full graph scan, then assembling a cartesian product of the entire set of nodes (10k * 10k = 100m pairs to start from), and then narrows that very large list down based on criteria in the where clause. (Maybe there are cypher optimizations here? I'm not sure)
I think adding an index on externalId would be a clear win and may provide enough of a performance gain for now, but you could also look at finding duplicates in a different way, perhaps something like this:
START n=node(*)
WHERE HAS(n.externalId)
WITH n
ORDER BY ID(n) ASC
WITH count(*) AS occurrences, n.externalId AS externalId, collect(ID(n)) AS ids
WHERE occurrences > 1
RETURN externalId, TAIL(ids)
来源:https://stackoverflow.com/questions/28302510/an-effective-way-to-lookup-duplicate-nodes-in-neo4j-1-8