neo4j batchimporter is slow with big IDs

问题

i want to import csv-Files with about 40 million lines into neo4j. For this i try to use the "batchimporter" from https://github.com/jexp/batch-import. Maybe it's a problem that i provide own IDs. This is the example

nodes.csv

i:id l:label

315041100 Person

201215100 Person

315041200 Person

rels.csv :

start end type relart

315041100 201215100 HAS_RELATION 30006

315041200 315041100 HAS_RELATION 30006

the content of batch.properties:

use_memory_mapped_buffers=true
neostore.nodestore.db.mapped_memory=1000M
neostore.relationshipstore.db.mapped_memory=5000M
neostore.propertystore.db.mapped_memory=4G
neostore.propertystore.db.strings.mapped_memory=2000M
neostore.propertystore.db.arrays.mapped_memory=1000M
neostore.propertystore.db.index.keys.mapped_memory=1500M
neostore.propertystore.db.index.mapped_memory=1500M
batch_import.node_index.node_auto_index=exact


./import.sh graph.db nodes.csv rels.csv

will be processed without errors, but it takes about 60 seconds!

Importing 3 Nodes took 0 seconds 
Importing 2 Relationships took 0 seconds 
Total import time: 54 seconds

When i use smaller IDs - for example 3150411 instead of 315041100 - it takes just 1 second!

Importing 3 Nodes took 0 seconds 
Importing 2 Relationships took 0 seconds 
Total import time: 1 seconds

Actually i would take even bigger IDs with 10 digits. I don't know what i'm doing wrong. Can anyone see an error?

JDK 1.7
batchimporter 2.1.3 (with neo4j 2.1.3)
OS: ubuntu 14.04
Hardware: 8-Core-Intel-CPU, 16GB RAM

回答1:

I think the problem is that the batch importer is interpreting those IDs as actual physical ids on disk. And so the time is spent in the file system, inflating the store files up to the size where they can fit those high ids.

The ids that you're giving are intended to be "internal" to the batch import, or? Although I'm not sure how to tell the batch importer that is the case.

@michael-hunger any input there?

回答2:

the problem is that those ID's are internal to Neo4j where they represent disk record-ids. if you provide high values there, Neo4j will create a lot of empty records until it reaches your ids.

So either you create your node-id's starting from 0 and you store your id as normal node property. Or you don't provide node-id's at all and only lookup nodes via their "business-id-value"

i:id    id:long    l:label
0    315041100    Person
1    201215100    Person
2    315041200    Person

start:id    end:id    type    relart
0    1    HAS_RELATION    30006
2    0    HAS_RELATION    30006

or you have to configure and use an index:

id:long:people    l:label
315041100    Person
201215100    Person
315041200    Person

id:long:people    id:long:people    type    relart
0    1    HAS_RELATION    30006
2    0    HAS_RELATION    30006

HTH Michael

Alternatively you can also just write a small java or groovy program to import your data if handling those ids with the batch-importer is too tricky. See: http://jexp.de/blog/2014/10/flexible-neo4j-batch-import-with-groovy/

来源：https://stackoverflow.com/questions/26627394/neo4j-batchimporter-is-slow-with-big-ids

标签

csv

neo4j