Optimizing Cypher queries with lots of optional relationships

问题

I'm using Cypher over the batch REST API with Neo4j 2.0.1.

I'm trying to optimize my queries which have lots of optional relationships. I'd like to retrieve all of the data in one shot to limit the number of round trips I have to make to the database. Even though I only have about 12000 nodes in my database, the queries are already starting to crawl (some are taking over 1.5 seconds to return 1000 nodes).

I've set up a graph gist that goes into more detail at http://gist.neo4j.org/?9494429e3cbbbeda2b11.

My queries all generally take the following form:

MATCH (u:user { id: "u1" })
WITH u

MATCH u-[:CONTACT]->(c:contact)
WITH u, c

OPTIONAL MATCH  (c)-[:CREATED]->(xca:activity)<-[:USERACTIVITY]-(xcc:contact)
OPTIONAL MATCH  (c)-[:HISTORY]->(xcu:activity)<-[:USERACTIVITY]-(xuc:contact)
OPTIONAL MATCH (c)-[:PHONE]->(xp:phone)
OPTIONAL MATCH (c)-[:ADDRESS]->(xa:address)
OPTIONAL MATCH (u)-[:PHONE]->(xup:phone)
OPTIONAL MATCH (u)-[:ADDRESS]->(xua:address)
WITH DISTINCT c AS x, u,
    COLLECT(DISTINCT xp) AS xps,
    COLLECT(DISTINCT xa) AS xas,
    COLLECT(DISTINCT xup) AS xups,
    COLLECT(DISTINCT xua) AS xuas,
    xca.createdat AS createdat,
    xcu.createdat AS updatedat,
    {id: xcc.id} AS createdby,
    {id: xuc.id} AS updatedby
RETURN COLLECT({
    id: x.id,
    name:  COALESCE(u.name, x.name), 
    createdat: createdat,
    createdby: createdby,
    updatedat: updatedat,
    updatedby: updatedby,
        phones: (CASE WHEN size(xps)= 0
          THEN NULL
          ELSE [xp IN xps | { id: xp.id, number: xp.number}]
          END),
        userphones: (CASE WHEN size(xups)= 0
          THEN NULL
          ELSE [xup IN xups | { id: xup.id, number: xup.number }]
          END),
        addresses: (CASE WHEN size(xas)= 0
          THEN NULL
          ELSE [xa IN xas | { id: xa.id, city: xa.city}]
          END),
        useraddresses: (CASE WHEN size(xuas)= 0
          THEN NULL
          ELSE [xua IN xuas | { id: xua.id, city: xua.city}]
          END)
}) AS r

Is there a better way to query for nodes that have lots of optional relationships? What type of performance should I be expecting in cases like these? Thanks!

UPDATE

I've rewritten the query based on Michael's suggestion and verified that each step has a cardinality of 1 (that is, only one row is returned for each step of the query). When I return just a single contact the query takes about 400ms (about 5x worse than the original query).

However, when I try and run the query against my larger dataset that returns 1000 contacts it now just hangs, never completes, and I have to restart the Neo4j server. I didn't try and combine the results into a new map, but I don't think that is going to solve the problem. Am I creating a worse cross product now that doesn't show up when I test the query step by step?

  MATCH (u:user { id: "123" })
  WITH u

  MATCH (u)-[:CONTACT]->(c:contact)
  WITH c

  OPTIONAL MATCH
    (c)-[:CREATED]->(xca:activity)-[:USERACTIVITY*1..4]<-(xcc:contact),
    (c)-[:HISTORY]->(xcu:activity)-[:USERACTIVITY*1..4]<-(xuc:contact)
  WITH c AS x,
    xca.createdat AS createdat, xcu.createdat AS updatedat,
    {id: xcc.id, object: xcc.object} AS createdby,
    {id: xuc.id, object: xuc.object} AS updatedby

  OPTIONAL MATCH
    (x)-[:PHONE]->(xp:phone)
  WITH x, createdat, updatedat, createdby, updatedby,
    COLLECT(xp) as xps

  OPTIONAL MATCH  
    (x)-[:ADDRESS]->(xa:address)
  WITH x, createdat, updatedat, createdby, updatedby, xps,
    COLLECT(xa) as xas

  OPTIONAL MATCH (xu:user)-[:CONTACT]->(x)
  OPTIONAL MATCH (xu)-[:PHONE]->(xup:phone)
  WITH x, createdat, updatedat, createdby, updatedby, xps, xas,
    xu, COLLECT(xup) as xups

  OPTIONAL MATCH (xu)-[:ADDRESS]->(xua:address)
  WITH x, createdat, updatedat, createdby, updatedby, xps, xas,
    xu, xups, COLLECT(xua) as xuas

  RETURN COLLECT({
    id: x.id,
    object: x.object,
    status: x.status,
    teamid: x.teamid,
    name:  COALESCE(xu.name, x.name),
    displayname: COALESCE(xu.displayname, x.displayname),
    email: COALESCE(xu.email, x.email),
    imageurl: COALESCE(xu.imageurl, x.imageurl),
    workhours: x.workhours,
    notes: x.notes,
    company: x.company,
    createdat: createdat,
    createdby: createdby,
    updatedat: updatedat,
    updatedby: updatedby,
    isuser: (NOT xu IS NULL),
    phones: (CASE WHEN size(xps)= 0
      THEN NULL
      ELSE [xp IN xps | { id: xp.id, object: xp.object,
            number: xp.number, description: xp.description }]
      END),
    userphones: (CASE WHEN size(xups)= 0
      THEN NULL
      ELSE [xup IN xups | { id: xup.id, object: xup.object,
            number: xup.number, description: xup.description }]
      END),
    addresses: (CASE WHEN size(xas)= 0
      THEN NULL
      ELSE [xa IN xas | { id: xa.id, object: xa.object,
            street: xa.street, locality: xa.locality, region: xa.region,
            postcode: xa.postcode, country: xa.country, description: xa.description, neighborhood: xa.neighborhood }]
      END),
    useraddresses: (CASE WHEN size(xuas)= 0
      THEN NULL
      ELSE [xua IN xuas | { id: xua.id, object: xua.object,
            street: xua.street, locality: xua.locality, region: xua.region,
            postcode: xua.postcode, country: xua.country, description: xua.description, neighborhood: xua.neighborhood }]
      END)
  }) AS r

UPDATE 2

I tried moving the phones and addresses around but it had no effect, I even took them out and still saw similar results (over 2s for 1000 contacts). I've eliminated all of the complexity in the query just to see what a baseline would be. The following query takes 385ms on average to execute:

MATCH (t:team {id:"123"})
WITH t

MATCH (c:contact)-[:CONTACT]->(t)
WITH c AS x

RETURN COLLECT({
    id: x.id,
    object: x.object,
    status: x.status,
    teamid: x.teamid,
    name:  x.name,
    displayname: x.displayname,
    email: x.email,
    imageurl: x.imageurl,
    workhours: x.workhours,
    notes: x.notes,
    company: x.company
}) AS r

My database has 6000 nodes and 12000 relationships and this query returns 1000 contacts (the entire db is 7 MB in size). Is almost 400ms for this type of query expected?

I really appreciate the offer to look at my DB, but I guess I really want to know how to diagnose these problems myself. When I use the web UI, I see no explosion (only 1 row per result being returned). When I use the PROFILE command, I see no numbers in the millions like you expect.

Are there other tools available to diagnose performance issues? Is there a debugger of some kind to track down the problem?

回答1:

The issue is that you create cross products between all of your matches.

If you can identify the matches that have at most one connection you can pull them upfront. Otherwise you can collect the matched information to get back to your cardinality of 1 (or # of contacts ftm).

e.g.

MATCH (u:user { id: "u1" })
OPTIONAL MATCH (u)-[:PHONE]->(xup:phone)
OPTIONAL MATCH (u)-[:ADDRESS]->(xua:address)
// cardinality 1
WITH u, collect(distinct xup) as phones, collect(distinct xua) as addresses
MATCH (u)-[:CONTACT]->(c:contact)
WITH u, c, phones, addresses
OPTIONAL MATCH (c)-[:CREATED]->(xca:activity)<-[:USERACTIVITY]-(xcc:contact)
WITH u,c, phones,addresses, collect(distinct xcc) as contact_activities
...

You already used map literals so you can also combine them with what I suggested by incrementally adding keys to the map (or a collection)

e.g.

MATCH (u:user { id: "u1" })
OPTIONAL MATCH (u)-[:PHONE]->(xup:phone)
OPTIONAL MATCH (u)-[:ADDRESS]->(xua:address)
// cardinality 1
WITH u, {user:u, phones:collect(distinct xup), addresses: collect(distinct xua)} as user_info
MATCH (u)-[:CONTACT]->(c:contact)
WITH c, user_info
OPTIONAL MATCH (c)-[:CREATED]->(xca:activity)<-[:USERACTIVITY]-(xcc:contact)
WITH c, user_info, {activities: collect(distinct xcc)} as contact_info
...

Also DISTINCT and aggregation will very probably not help you. Aggregation is already creating distinct entries for the grouping key.

I tried to adapt your graphgist (thanks for providing that btw) to show how it could look like (but I didn't go fully through): http://gist.neo4j.org/?bba019835045ed352925

You might be also interested in this graphgist: A complex query result projection in cypher

来源：https://stackoverflow.com/questions/23721048/optimizing-cypher-queries-with-lots-of-optional-relationships

标签

rest

neo4j

cypher