can't reproduce/verify the performance claims in graph databases and neo4j in action books

问题

UPDATE I have put up a follow up question that contains updated scripts and and a clearer setup on neo4j performance compared to mysql (how can it be improved?). Please continue there./UPDATE

I have some problems verifying the performance claims made in the "graph databases" book (page 20) and in the neo4j (chapter 1).

To verify these claims I created a sample dataset of 100000 'person' entries with 50 'friends' each, and tried to query for e.g. friends 4 hops away. I used the very same dataset in mysql. With friends of friends over 4 hops mysql returns in 0.93 secs, while neo4j needs 65 -75 secs (on repeated calls).

How can I improve this miserable outcome, and verify the claims made in the books?

A bit more detail:

I run the whole setup on a i5-3570K with 16GB Ram, using ubuntu12.04 64bit, java version "1.7.0_25" and mysql 5.5.31, neo4j-community-2.0.0-M03 (I get a similar outcome with 1.9)

All code/sample data can be found on https://github.com/jhb/neo4j-experiements/ (to be used with 2.0.0). The resulting sample data in different formats can be found on https://github.com/jhb/neo4j-testdata.

To use the scripts you need a python with mysql-python, requests and simplejson installed.

the dataset is created with friendsdata.py and stored to friends.pickle
friends.pickle gets imported to neo4j using import_friends_neo4j.py
friends.pickle gets imported to mysql using import_friends_mysql.py
I add indexes on t_user_friend.* in mysql
I added "create index on :node(noscenda_name) in neo4j

To make life a bit easier the friends.*.bz2 contain sql and cypher statements to create those datasets in mysql and neo4j 2.0 M3.

Mysql performance

I first warm mysql up by querying:

select count(distinct name) from t_user;
select count(distinct name) from t_user;

Then, for the real meassurment I do

python query_friends_mysql.py 4 10

This creates the following sql statement (with changing t_user.names):

select 
    count(*)
from
    t_user,
    t_user_friend as uf1, 
    t_user_friend as uf2, 
    t_user_friend as uf3, 
    t_user_friend as uf4
where
    t_user.name='person8601' and 
    t_user.id = uf1.user_1 and
    uf1.user_2 = uf2.user_1 and
    uf2.user_2 = uf3.user_1 and
    uf3.user_2 = uf4.user_1;

and repeats this 4 hop query 10 times. The queries need around 0.95 secs each. Mysql is configured to use a key_buffer of 4G.

neo4j performance testing

I have modified neo4j.properties:

neostore.nodestore.db.mapped_memory=25M
neostore.relationshipstore.db.mapped_memory=250M

and the neo4j-wrapper.conf:

wrapper.java.initmemory=2048
wrapper.java.maxmemory=8192

To warm up neo4j I do

start n=node(*) return count(n.noscenda_name);
start r=relationship(*) return count(r);

Then I start using the transactional http endpoint (but I get the same results using the neo4j-shell).

Still warming up, I run

./bin/python query_friends_neo4j.py 3 10

This creates a query of the form (with varying person ids):

{"statement": "match n:node-[r*3..3]->m:node where n.noscenda_name={target} return count(r);", "parameters": {"target": "person3089"}

after the 7th call or so each call needs around 0.7-0.8 secs.

Now for the real thing (4 hops) I do

./bin/python query_friends_neo4j.py 4 10

creating

{"statement": "match n:node-[r*4..4]->m:node where n.noscenda_name={target} return count(r);", "parameters": {"target": "person3089"}

and each call takes between 65 and 75 secs.

Open questions/thoughts

I'd really like see the claims in the books to be reproducable and correct, and neo4j faster then mysql instead of magnitudes slower.

But I don't know what I am doing wrong... :-(

So, my big hopes are:

I didn't do the memory settings for neo4j correctly
The query I use for neo4j is completely wrong

Any suggestions to get neo4j up to speed are highly welcome.

Thanks a lot,

Joerg

回答1:

2.0 has not been performance optimized at all, so you should use 1.9.2 for comparison. (if you use 2.0 - did you create an index for n.noscenda_name)

You can check the query plan with profile start ....

With 1.9 please use a manual index or node_auto_index for noscenda_name.

Can you try these queries:

start n=node:node_auto_index(noscenda_name={target})
match n-->()-->()-->m
return count(*);

Fulltext indexes are also more expensive than exact indexes, so keep the exact auto-index for noscenda_name.

can't get your importer to run, it fails at some point, perhaps you can share the finished neo4j database

python importer.py
reading rels
reading nodes
delete old
Traceback (most recent call last):
  File "importer.py", line 9, in <module>
    g.query('match n-[r]->m delete r;')
  File "/Users/mh/java/neo/neo4j-experiements/neo4jconnector.py", line 99, in query
    return self.call(payload)
  File "/Users/mh/java/neo/neo4j-experiements/neo4jconnector.py", line 71, in call
    self.transactionurl = result.headers['location']
  File "/Library/Python/2.7/site-packages/requests-1.2.3-py2.7.egg/requests/structures.py", line 77, in __getitem__
    return self._store[key.lower()][1]
KeyError: 'location'

回答2:

Just to add to what Michael said, in the book I believe the authors are referring to a comparison that was done in the Neo4j in Action book - it's described in the free first chapter of that book.

At the top of page 7 they explain that they were using the Traversal API rather than Cypher.

I think you'll struggle to get Cypher near that level of performance at the moment so if you want to do those types of queries you'll want to use the Traversal API directly and then perhaps wrap it in an unmanaged extension.

来源：https://stackoverflow.com/questions/17773644/cant-reproduce-verify-the-performance-claims-in-graph-databases-and-neo4j-in-ac

标签

performance

neo4j