问题
I want to insert a few billions nodes and relationships to Neo4j. Using "LOAD CSV" is being cancelled after 30 min by the browser (Chrome) as the working memory is overloaded, though I have 16GB RAM.
Large datasets apparently can be imported to Neo4j using the Batch Importer (Documentation & Download, Explanation for Linux ).
To simply use it (no source/git/maven required):
1. download 2.2 zip
2. unzip
3. run import.sh test.db nodes.csv rels.csv (on Windows: import.bat)
4. after the import point your /path/to/neo4j/conf/neo4j-server.properties
to this test.db directory, or copy the data over to your server cp -r
test.db/* /path/to/neo4j/data/graph.db/
You provide one tab separated csv file for nodes and one for
relationships (optionally more for indexes)
I struggle to use the plugin on Windows. In the Linux-Video by Rik Van Bruggen (link above) he mentions "installation of the batch importer".
I unzipped the file "download 2.2 zip". I have my CSVs in another folder. How do I use the "import.bat" command mentioned in the Documentation on WIndows? In cmd the command can't be found...
回答1:
Before using the tool for gigantic datasets, I can suggest you few things I just learned importing millions of nodes in few minutes (Neo4j Community Edition for Windows).
Regarding Neo4j import tips:
Don't use the web interface to import such big datasets, memory overload is inevitable.
Instead, use a programming language to interact with Neo4j (I recently used the official Python module and it's simply to learn but you can do the same with the good-old Java).
Before using the
LOAD CSV, remember to write theUSING PERIODIC COMMITinstructions in order to import big sets of data each iteration.Before importing relations from CSV, remember to use
CREATE CONSTRAINT ON <...> ASSERT <...> IS UNIQUEfor the key-properties of your labels. It will have a huge impact on relationships creation.Use
MATCH(...), notCREATE(...)for the relationship procedure. It will avoids duplicates.
Regarding Neo4j performance:
First of all: read the official Neo4j page for tuning performance: https://neo4j.com/docs/operations-manual/current/performance/
Set a proper memory configuration for your Windows machine: configure manually the
dbms.memory.pagecache.sizeparameter (inneo4j.conffile), if necessary.Remember: the Java Virtual Machine is not a black box; you can improve its performance specifically for your application (editing the neo4j-community.vmoptions file). For example, you can set the max memory usage for the JVM (
-Xmxparameter), you can also set the-XX:+UseG1GCparameter to using the G1 Garbage Collector (high performance, suggested by Oracle for production enviroment) (https://docs.oracle.com/cd/E40972_01/doc.70/e40973/cnf_jvmgc.htm#autoId0)
I'll post my neo4j.conf custom lines used for my configuration (just for reference, it may be a wrong setup for your application, beware):
dbms.memory.pagecache.size=3g
dbms.jvm.additional=-XX:+UseG1GC
dbms.jvm.additional=-XX:-OmitStackTraceInFastThrow
dbms.jvm.additional=-XX:+AlwaysPreTouch
dbms.jvm.additional=-XX:+UnlockExperimentalVMOptions
dbms.jvm.additional=-XX:+TrustFinalNonStaticFields
dbms.jvm.additional=-XX:+DisableExplicitGC
And my neo4j-community.vmoptions custom lines (again, just for reference):
-Xmx1024m
-XX:+UseG1GC
-OmitStackTraceInFastThrow
-XX:+AlwaysPreTouch
-XX:+UnlockExperimentalVMOptions
-XX:+TrustFinalNonStaticFields
-XX:+DisableExplicitGC
My test machine is a weak notebook equipped with an Core i3 (dual core), with 8GB of RAM, Windows 10 and Neo4j 3.2.1 Community Edition.
I'm capable of importing 7 millions of nodes in less than 3 minutes and 3.5 millions of relationships in less than 5 minutes (no recursive relationships).
In a more capable machine, with a specific crafted setup, Neo4j can do WAY better than this. Hope it helps.
回答2:
- If you use LOAD CSV with PERIODIC COMMIT, you should not run into any memory trouble. A few billion nodes (so delightfully vague :-) may take a while to load though.
- https://neo4j.com/docs/operations-manual/current/tutorial/import-tool/ explains how to load the database in offline mode. You can use either neo4j-import or neo4j-admin import (both are commandline, the second one is replacing the first one) to do that. No plugins are needed at all. Make sure you are using a Windows zip installation (CE or EE), the exe installation (CE only) may not contain these tools
- If you're doing any form of massive updates, the browser is never a good choice. Seriously. It's meant to do visualizations and if you give it any chance in your syntax it will try to do so. Is that really what you want for a long-running batch update ? Use cypher-shell (commandline) instead. Many many issues here on Stackoverflow are not actually Neo4j issues but just people overloading the dom structure of the browser (so it's actually a Firefox or Chrome issue)
Hope this helps.
Regards, Tom
来源:https://stackoverflow.com/questions/45770769/import-billions-of-nodes-and-relationships-to-neo4j-using-batch-import-on-window