Load a large csv file into neo4j

问题

I want to load a csv that contains relationships between Wikipedia categories rels.csv (4 million of relations between categories). I tried to modify the setting file by changing the following parameter values:

dbms.memory.heap.initial_size=8G 
dbms.memory.heap.max_size=8G
dbms.memory.pagecache.size=9G

My query is as follows:

USING PERIODIC COMMIT 10000
LOAD CSV FROM 
"https://github.com/jbarrasa/datasets/blob/master/wikipedia/data/rels.csv?raw=true" AS row
    MATCH (from:Category { catId: row[0]})
    MATCH (to:Category { catId: row[1]})
    CREATE (from)-[:SUBCAT_OF]->(to)

Moreover, I created indexes on catId and catName. Despite all these optimizations, the query still running (since yesterday).

Can you tell me if there are more optimization that should be done to load this CSV file?

回答1:

It's taking too much time. 4 Millions of relationships should take a few minutes if not seconds.

I just loaded all the data from the link you shared in 321 seconds (Cats-90, and Rels-231) with less than half of your memory settings on my personal laptop.

dbms.memory.heap.initial_size=1G  
dbms.memory.heap.max_size=4G 
dbms.memory.pagecache.size=1512m

And this is not the limit, Can be improved further.

Slightly Modified query: Increased LIMIT 10 times

USING PERIODIC COMMIT 100000
LOAD CSV FROM 
"https://github.com/jbarrasa/datasets/blob/master/wikipedia/data/rels.csv?raw=true" AS row
    MATCH (from:Category { catId: row[0]})
    MATCH (to:Category { catId: row[1]})
    CREATE (from)-[:SUBCAT_OF]->(to)

Some suggestions:

Create an index on the fields that are used to search nodes. (No need to index on others fields while loading data it can be done later, it consumes unnecessary memory)
Don't set the max heap size to full of system RAM. Set it to 50% of RAM.
Increase LIMIT: If you are increasing Heap(RAM) it will not increase the performance unless it is used. When you set LIMIT to 10,000 then most part of the Heap will be free. I am able to load data with limit 100,000 with 4G Heap. You can set 200,000 or more. If it causes any issue try decreasing it.
IMPORTANT Make sure you restart the Neo4j after changing/setting configurations. (If not done already).

Don't forget to delete previous data when you run load CSV query next time, as it will create duplicates.

NOTE: I downloaded the files to the laptop and used same so there is no download time.

来源：https://stackoverflow.com/questions/56547442/load-a-large-csv-file-into-neo4j

标签

neo4j

cypher

database-performance

load-csv