Cypher MATCH query speed

非 Y 不嫁゛ 提交于 2019-12-19 10:23:59

问题


I have Neo4j installed on a windows machine with 12 processors and 64GB ram. I did not change any of the memory settings that Neo4j allows for.

My database has 3.8m nodes, 210,000 of which are labeled as Geotagged and a total of 650,000 relationships. I am trying to run the following query and I am wondering if this is a really intensive query that will likely take quite a while.

Messages.csv is my relationship file. The relationships have already been created, but as I could not figure out how to combine the relationship creation with the below Distance generation, I am loading and running through the relationship file twice.

USING PERIODIC COMMIT 15000
LOAD CSV WITH HEADERS FROM "file:d:/messages.csv" AS line
MATCH (a:Geotagged { username: line.sender }) - [r:MSGED] -> (b:Geotagged { username: line.recipient })
SET r.Distance = (2 * 6371 * asin(sqrt(haversin(radians(toFloat(b.statusLat) - toFloat(a.statusLat))) + cos(radians(toFloat(b.statusLat))) * cos(radians(toFloat(a.statusLat))) * haversin(radians(toFloat(b.statusLon) - toFloat(a.statusLon))))));

The initial relationship generation takes about 3-5 minutes. I let the above run for over an hour and it still was not complete. I ran a similar algorithm (though it had a few more trig calls in it) on the same initial db and let it run for over 18 hours and still had not completed.

My question: Is this a very intensive query? Am I not giving it enough time? And more importantly, is there a way I can optimize this?

I tried adding "WHERE NOT HAS(r.Distance)" to exclude node pairs that the algorithm has already set the Distance on, though I am unsure if the MATCH is a one-time match or if it will MATCH for each line in the CSV file?

Any thoughts on this would really be appreciated.


回答1:


One way that I would start to debug is to put a limit on it using WITH:

USING PERIODIC COMMIT 15000
LOAD CSV WITH HEADERS FROM "file:d:/messages.csv" AS line
WITH line LIMIT 100
MATCH (a:Geotagged { username: line.sender }) - [r:MSGED] -> (b:Geotagged { username: line.recipient })
SET r.Distance = (2 * 6371 * asin(sqrt(haversin(radians(toFloat(b.statusLat) - toFloat(a.statusLat))) + cos(radians(toFloat(b.statusLat))) * cos(radians(toFloat(a.statusLat))) * haversin(radians(toFloat(b.statusLon) - toFloat(a.statusLon))))));

With that you can change the LIMIT number to see how the performance degrades as the limit increases.

Also, is the username property indexes for the Geotagged label? If not it definitely should be, like this:

CREATE INDEX ON :Geotagged(username)

If it's unique and you want the database to enforce that:

CREATE CONSTRAINT ON (g:Geotagged) ASSERT g.username IS UNIQUE



回答2:


This is additional to Brian's reply:

Your statement's query plan shows EAGER, to verify run

EXPLAIN explain LOAD CSV WITH HEADERS FROM "file:d:/messages.csv" AS line
WITH line LIMIT 100
MATCH (a:Geotagged { username: line.sender }) - [r:MSGED] -> (b:Geotagged { username: line.recipient })
SET r.Distance = (2 * 6371 *asin(sqrt(haversin(radians(toFloat(b.statusLat) - toFloat(a.statusLat))) + cos(radians(toFloat(b.statusLat))) * cos(radians(toFloat(a.statusLat))) * haversin(radians(toFloat(b.statusLon) - toFloat(a.statusLon))))));

Eagerness in LOAD CSV is pretty bad, see the these blog posts why:

  • http://www.markhneedham.com/blog/2014/10/23/neo4j-cypher-avoiding-the-eager/
  • http://jexp.de/blog/2014/10/load-cvs-with-success/

Following Mark's suggested and replacing the MATCH/SET with a MERGE ON MATCH SET we can refactor that into:

explain LOAD CSV WITH HEADERS FROM "file:d:/messages.csv" AS line
WITH line LIMIT 100
MATCH (a:Geotagged { username: line.sender }), (b:Geotagged { username: line.recipient })
MERGE (a)-[r:MSGED]->(b)
ON MATCH SET r.Distance = (2 * 6371 * asin(sqrt(haversin(radians(toFloat(b.statusLat) - toFloat(a.statusLat))) + cos(radians(toFloat(b.statusLat))) * cos(radians(toFloat(a.statusLat))) * haversin(radians(toFloat(b.statusLon) - toFloat(a.statusLon))))));

And eager has vanished.



来源:https://stackoverflow.com/questions/29521401/cypher-match-query-speed

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!