Does Blazegraph support range query optimization?

谁说我不能喝 提交于 2021-01-29 06:41:01

问题


I am playing with Blazegraph. I insert some triples representing 'events', each of 'event' contains 3 triples and looks like this:

<%event-iri%> <http://predicates/timestamp> '2020-01-02T03:04:05.000Z'^^xsd:dateTime .
<%event-iri%> <http://predicates/a> %RANDOM_UUID% .
<%event-iri%> <http://predicates/b> %RANDOM_UUID% .

Timestamps represent consecutive moments of time, each next event is 1 minute later than the previous one.

I made two sets of tests: once having 1 million events (so 3 million triples), and once having 3 million events (9 million triples).

I run queries like the following:

select ?event ?a ?v 
where {
  ?event <http://predicates/timestamp> ?timestamp .
  filter (?timestamp >= '2020-01-02T03:04:05.000Z'^^xsd:dateTime && ?timestamp < '2020-01-02T03:03:05.000Z'^^xsd:dateTime)
  ?event ?a ?v .
}

I started with queries returning 1000 events (3000 triples) and then went down to queries that only match 1 event (and return 3 triples) to make sure that result data set size does not influence the range query performance itself too much.

I also tried adding a hint found here https://sourceforge.net/p/bigdata/discussion/676946/thread/2cf9a1e8/?limit=25 to tell Blazegraph that it should use range query optimization by adding the following

hint:Prior hint:rangeSafe "true" .

Right after the filter clause.

Also, it was mentioned that for some types range queries do not work while working for others (for ints they worked for johpfe), so I also tried to do another set of tests where timestamps are represented as ints (Unix timestamps):

<%event-iri%> <http://predicates/timestamp> 1606528746 .
<%event-iri%> <http://predicates/a> %RANDOM_UUID% .
<%event-iri%> <http://predicates/b> %RANDOM_UUID% .

The final query I tried was

select ?event ?a ?v 
where {
  ?event <http://predicates/timestamp> ?timestamp .
  filter (?timestamp >= 1606528746 && ?timestamp < 1606528806)
  hint:Prior hint:rangeSafe "true" .
  ?event ?a ?v .
}

Whatever I try, I get the following results: for the smaller dataset (1 million timestamps/ints) queries take 1 second, sometimes more, but not less; for the bigger dataset (3 million timestamps/ints) queries take at least 3 seconds.

The difference is 3x, which perfectly correlates with 3x change of data volume. So it looks like the range optimization is not working.

I also compared against MongoDB. Having an index on 'timestamp' field, it always executes an analogous query in 30-50ms, no matter on what data size.

What do I do wrong? Is there a way to make Blazegraph apply the optimization here?

PS. I also tried putting the hint right after a triple pattern, not filter statement, as per https://github.com/blazegraph/database/wiki/QueryHints which says the following about rangeSafe hint:

Declare that the data touched by the query for a specific triple pattern is strongly typed, thus allowing a range filter to be pushed down onto an index.

So the query became

select ?event ?a ?v 
where {
  ?event <http://predicates/timestamp> ?timestamp .
  hint:Prior hint:rangeSafe "true" .
  filter (?timestamp >= 1606528746 && ?timestamp < 1606528806)
  ?event ?a ?v .
}

But this query finds nothing, so the hint just breaks it.


回答1:


Here are the queries where the optimization does work.

This one is for the case of integers:

select ?event ?a ?v 
where {
  ?event <http://predicates/timestamp> ?timestamp .
  hint:Prior hint:rangeSafe "true" .
  filter (?timestamp >= "1606528746"^^xsd:int && ?timestamp < "1606528806"^^xsd:int)
  ?event ?a ?v .
}

And this one is for the case when timestamps are date-times:

select ?event ?a ?v 
where {
  ?event <http://predicates/timestamp> ?timestamp .
  hint:Prior hint:rangeSafe true .
  filter (?timestamp >= '2021-11-07T22:08:24.022+04:00'^^xsd:dateTime && ?timestamp < '2021-11-07T22:10:24.022+04:00'^^xsd:dateTime)
  ?event ?a ?v .
}

What prevented the optimization from kicking in was:

  1. For the case of integers, I did not specify literal type explicitly, so it was like ?timestamp >= 1606528746 instead of ?timestamp >= "1606528746"^^xsd:int. Strangely enough, it breaks the optimization.
  2. Also, the hint must be specified right after the triple pattern and NOT after the filter statement.

Also, it turned out that it is not important whether the hint contains true or "true": both options work successfully.

Many thanks to @StanislavKralin for giving a working example using which I was able to transform my queries to a working form.



来源:https://stackoverflow.com/questions/65000453/does-blazegraph-support-range-query-optimization

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!