An effective way to lookup duplicate nodes in Neo4j 1.8?

问题

I'm trying to programmatically locate all duplicate nodes in a Neo4j 1.8 database (using Neo4j 1.8). The nodes that need examination all have a (non-indexed) property externalId for which I want to find duplicates of. This is the Cypher query I've got:

            START n=node(*), dup=node(*) WHERE 
                HAS(n.externalId) AND HAS(dup.externalId) AND 
                n.externalId=dup.externalId AND 
                ID(n) < ID(dup) 
                RETURN dup

There are less than 10K nodes in the data and less than 1K nodes with an externalId. The query above is working but seems to perform badly. Is there a less memory consuming way to do this?

回答1:

Try this query:

START n=node(*)
WHERE HAS(n.externalId)
WITH n.externalId AS extId, COLLECT(n) AS cn
WHERE LENGTH(cn) > 1
RETURN extId, cn;

It avoids taking the Cartesian product of your nodes. It finds the distinct externalId values, collects all the nodes with the same id, and then filters out the non-duplicated ids. Each row in the result will contain an externalId and a collection of the duplicate nodes with that id.

回答2:

The start clause consists of a full graph scan, then assembling a cartesian product of the entire set of nodes (10k * 10k = 100m pairs to start from), and then narrows that very large list down based on criteria in the where clause. (Maybe there are cypher optimizations here? I'm not sure)

I think adding an index on externalId would be a clear win and may provide enough of a performance gain for now, but you could also look at finding duplicates in a different way, perhaps something like this:

START n=node(*)
WHERE HAS(n.externalId)
WITH n
ORDER BY ID(n) ASC 
WITH count(*) AS occurrences, n.externalId AS externalId, collect(ID(n)) AS ids
WHERE occurrences > 1
RETURN externalId, TAIL(ids)

来源：https://stackoverflow.com/questions/28302510/an-effective-way-to-lookup-duplicate-nodes-in-neo4j-1-8

标签

performance

neo4j

cypher