Import billions of nodes and relationships to Neo4j using Batch Import on Windows

后端 未结 2 1957
梦如初夏
梦如初夏 2020-12-10 21:31

I want to insert a few billions nodes and relationships to Neo4j. Using \"LOAD CSV\" is being cancelled after 30 min by the browser (Chrome) as the working memory is overloa

相关标签:
2条回答
  • 2020-12-10 22:00
    1. If you use LOAD CSV with PERIODIC COMMIT, you should not run into any memory trouble. A few billion nodes (so delightfully vague :-) may take a while to load though.
    2. https://neo4j.com/docs/operations-manual/current/tutorial/import-tool/ explains how to load the database in offline mode. You can use either neo4j-import or neo4j-admin import (both are commandline, the second one is replacing the first one) to do that. No plugins are needed at all. Make sure you are using a Windows zip installation (CE or EE), the exe installation (CE only) may not contain these tools
    3. If you're doing any form of massive updates, the browser is never a good choice. Seriously. It's meant to do visualizations and if you give it any chance in your syntax it will try to do so. Is that really what you want for a long-running batch update ? Use cypher-shell (commandline) instead. Many many issues here on Stackoverflow are not actually Neo4j issues but just people overloading the dom structure of the browser (so it's actually a Firefox or Chrome issue)

    Hope this helps.

    Regards, Tom

    0 讨论(0)
  • 2020-12-10 22:17

    Before using the tool for gigantic datasets, I can suggest you few things I just learned importing millions of nodes in few minutes (Neo4j Community Edition for Windows).

    Regarding Neo4j import tips:

    • Don't use the web interface to import such big datasets, memory overload is inevitable.

    • Instead, use a programming language to interact with Neo4j (I recently used the official Python module and it's simply to learn but you can do the same with the good-old Java).

    • Before using the LOAD CSV, remember to write the USING PERIODIC COMMIT instructions in order to import big sets of data each iteration.

    • Before importing relations from CSV, remember to use CREATE CONSTRAINT ON <...> ASSERT <...> IS UNIQUE for the key-properties of your labels. It will have a huge impact on relationships creation.

    • Use MATCH(...), not CREATE(...) for the relationship procedure. It will avoids duplicates.

    Regarding Neo4j performance:

    • First of all: read the official Neo4j page for tuning performance: https://neo4j.com/docs/operations-manual/current/performance/

    • Set a proper memory configuration for your Windows machine: configure manually the dbms.memory.pagecache.size parameter (in neo4j.conf file), if necessary.

    • Remember: the Java Virtual Machine is not a black box; you can improve its performance specifically for your application (editing the neo4j-community.vmoptions file). For example, you can set the max memory usage for the JVM (-Xmx parameter), you can also set the -XX:+UseG1GC parameter to using the G1 Garbage Collector (high performance, suggested by Oracle for production enviroment) (https://docs.oracle.com/cd/E40972_01/doc.70/e40973/cnf_jvmgc.htm#autoId0)

    I'll post my neo4j.conf custom lines used for my configuration (just for reference, it may be a wrong setup for your application, beware):

    dbms.memory.pagecache.size=3g
    dbms.jvm.additional=-XX:+UseG1GC
    dbms.jvm.additional=-XX:-OmitStackTraceInFastThrow
    dbms.jvm.additional=-XX:+AlwaysPreTouch
    dbms.jvm.additional=-XX:+UnlockExperimentalVMOptions
    dbms.jvm.additional=-XX:+TrustFinalNonStaticFields
    dbms.jvm.additional=-XX:+DisableExplicitGC
    

    And my neo4j-community.vmoptions custom lines (again, just for reference):

    -Xmx1024m
    -XX:+UseG1GC
    -OmitStackTraceInFastThrow
    -XX:+AlwaysPreTouch
    -XX:+UnlockExperimentalVMOptions
    -XX:+TrustFinalNonStaticFields
    -XX:+DisableExplicitGC
    

    My test machine is a weak notebook equipped with an Core i3 (dual core), with 8GB of RAM, Windows 10 and Neo4j 3.2.1 Community Edition.

    I'm capable of importing 7 millions of nodes in less than 3 minutes and 3.5 millions of relationships in less than 5 minutes (no recursive relationships).

    In a more capable machine, with a specific crafted setup, Neo4j can do WAY better than this. Hope it helps.

    0 讨论(0)
提交回复
热议问题