Gremlin correlated queries kill performance

回眸只為那壹抹淺笑 提交于 2019-12-08 14:00:13

问题


I understand that implementation specifics factor into this question, but I also realize that I may be doing something wrong here. If so, what could I do better? If Gremlin has some multi-resultset submit queries batch feature I don't know about, that would solve the problem. As in, hey Gremlin, run these three queries in parallel and give me their results.

Essentially, I need to know when a vertex has a certain edge and if it doesn't have that edge, I need to pull a blank. So...

g.V().as("v").coalesce(outE("someLabel").has("someProperty","someValue"),constant()).as("e").select("v","e")

That query is 10x more expensive than simply getting the edges using:

g.V().outE("someLabel").has("someProperty","someValue")

So if I want to get a set of vertices with their edges or blank placeholders, I have two options: Make two discrete queries and "join" the data in the API or make one very expensive query.

I'm working from the assumption that in Gremlin, we "do it in one trip" and that may in fact be quite wrong. That said, I also know that pulling back chunks of data and effectively doing joins in the API is bad practice because it breaks the encapsulation principal. It also adds roundtrip overhead.


回答1:


OK, so I found a solution that is ridiculous but fast. It involves fudging the traversal so let me apologize up front if there's a better way...

g.inject(true).
union(
  __.V().not(outE("someLabel")).constant().as("ridiculous"),
  __.V().outE("someLabel").as("ridiculous")
).
select("ridiculous")

In essence, I have to write the query twice. Once for the traversal with the edge I want and once more for the traversal where the edge is missing. So, if I have n present / not present checks I'm going to need 2 ^ n copies of the query each with its own combination of checks so that I get the most optimal performance. Unfortunately, taking that approach runs the risk of a stack overflow not to mention making code impossible to manage reliably.




回答2:


Your original query returned vertex-edge pairs, where as your answer returns only edges.

You could just run g.E().hasLabel("somelabel") to get the same result.

Probably a faster alternative to your original query might be:

g.E().hasLabel("somelabel").as("e").outV().as("v").select("v","e")

Or

g.V().as("v").outE("somelabel").as("e").select("v","e")



回答3:


If Gremlin has some multi-resultset submit queries batch feature I don't know about, that would solve the problem

Gremlin/TinkerPop do not have such functionality built in. There is at least one graph that does have some form of Gremlin batching - DataStax Graph...not sure about others.

I'm also not sure I really have any answer that you might find useful, but while I wouldn't expect a 10x difference in performance between those two traversals, I would expect the first to perform worse on most graph databases. Basically, the use of named steps with as() enables path calculation requirements on the traversal which increases costs. When optimizing Gremlin, one of my earliest steps is to try to look for ways to factor out anything that might do that.

This question seems related to your other question on Jagged Result Array and but I'm having trouble maintaining the context from one question into the other to understand how to expound further.



来源:https://stackoverflow.com/questions/58549057/gremlin-correlated-queries-kill-performance

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!