Neo4j: Best way to batch relate nodes using Cypher?

﹥>﹥吖頭↗ 提交于 2020-01-11 12:54:05

问题


When I run a script that tries to batch merge all nodes a certain types, I am getting some weird performance results.

When merging 2 collections of nodes (~42k) and (~26k), the performance is nice and fast. But when I merge (~42) and (5), performance DRAMATICALLY degrades. I'm batching the ParentNodes (so (~42k) split up in batches of 500. Why does performance drop when I'm, essentially, merging less nodes (when the batch set is the same, but the source of the batch set is high and the target set is low)?

Relation Query:

MATCH (s:ContactPlayer)   
WHERE  has(s.ContactPrefixTypeId)    
WITH  collect(s) AS allP   
WITH  allP[7000..7500] as rangedP   
FOREACH  (parent in rangedP  |  
    MERGE (child:ContactPrefixType 
            {ContactPrefixTypeId:parent.ContactPrefixTypeId}
          )  
    MERGE (child)-[r:CONTACTPLAYER]->(parent)  
    SET r.ContactPlayerId = parent.ContactPlayerId ,      
        r.ContactPrefixTypeId = child.ContactPrefixTypeId  )

Performance Results:

Process Starting

Starting to insert Contact items [+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++]


  • Total time for 42149 Contact items: 19176.87ms
  • Average time per batch (500): 213.4ms
  • Longest batch time: 663ms

Starting to insert ContactPlayer items [++++++++++++++++++++++++++++++++++++++++++++++++++++++++]


  • Total time for 27970 ContactPlayer items: 9419.2106ms
  • Average time per batch (500): 167.75ms
  • Longest batch time: 689ms

Starting to relate Contact to ContactPlayer [++++++++++++++++++++++++++++++++++++++++++++++++++++++++]


  • Total time taken to relate Contact to ContactPlayer: 7907.4877ms
  • Average time per batch (500): 141.151517857143ms
  • Longest batch time: 883.0918ms for Batch number: 0

Starting to insert ContactPrefixType items
[+]


  • Total time for 5 ContactPrefixType items: 22.0737ms
  • Average time per batch (500): 22ms
  • Longest batch time: 22ms

Already inserted data for Contact.

Starting to relate ContactPrefixType to Contact [+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++]


  • Total time taken to relate ContactPrefixType to Contact: 376540.8309ms
  • Average time per batch (500): 4429.78643647059ms
  • Longest batch time: 14263.1843ms for Batch number: 63

回答1:


So far, the best I could come up with is the following (and it's a hack, specific to my environment):

If / Else condition:

If childrenNodes.count() < 200 -> assume they are type identifiers for the parent... i.e. ContactPrefixType

Else assume it is a matrix for relating multiple item types together (i.e. ContactAddress)

If childNodes < 200

MATCH (parent:{parentLabel}), 
(child:{childLabel} {{childLabelIdProperty}:parent.{parentRelationProperty}})
CREATE child-[r:{relationshipLabel}]->parent

This takes about 3-5 seconds to complete per relationship type

Else

MATCH (child:{childLabel}), 
(parent:{parentLabel} {{parentPropertyField : child.{childLabelIdProperty}})
WITH collect(parent) as parentCollection, child
WITH parentCollection[{batchStart}..{batchEnd}] as coll, child
FOREACH (parent in coll | 
CREATE child-[r:{relationshipLabel}]-parent )

I'm not sure this is the most efficient way of doing this, but after trying MANY different options, this seems to be the fastest.

Stats:

  1. insert 225,018 nodes with 2,070,977 properties
  2. create 464,606 relationships

Total: 331 seconds.

Because this is a straight import and I'm not dealing with updates yet, I assume that all the relationships are correct and don't need to worry about invalid data... however, I will try to set properties to the relationship type so as to be able to perform cleanup functions later (i.e. store the parent and child Id's in the relationship type as properties for later reference)

If anyone can improve on this, I would love it.




回答2:


Can you pass the ids in as parameters rather than fetch them from the graph? The query could look like

MATCH (s:ContactPlayer {ContactPrefixTypeId:{cptid})
MERGE (c:ContactPrefixType {ContactPrefixTypeId:{cptid})
MERGE c-[:CONTACT_PLAYER]->s

If you use the REST API Cypher resource, I think the entity should look something like

{
    "query":...,
    "params": {
        "cptid":id1
    }
}

If you use the transactional endpoint, it should look something like this. You control transaction size by the number of statements in each call, and also by the number of calls before you commit. More here.

{
    "statements":[
        "statement":...,
        "parameters": {
            "cptid":id1
        },
        "statement":...,
        "parameters": {
            "cptid":id2
        }
    ]
}


来源:https://stackoverflow.com/questions/22102181/neo4j-best-way-to-batch-relate-nodes-using-cypher

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!