How do you set the Collation for a SPARQL query?

问题

I am a Java developer working with a MarkLogic database. A key function of my code is its capacity to dynamically generate 4-6 SPARQL queries and run them via HTTP GET requests. The results of each are added together and then returned. I now need these results sorted consistently.

Since I am paging the results of each query (using the LIMIT and OFFSET statements) each query has its own ORDER BY statement. Without embedding sorting into the queries the pages of results will be returned out of order.

However, each query returns its own results which are individually sorted and need to be merged into a single sorted list. My preference would to be an alphanumeric sort that considers characters before considering case and that sorts empty and null values to the end. (Example: “0123456789AaBbCc…WwXxYyZz ”)

I have already done this in my Java code using a custom compare method, but I recently ran into a problem: my results still aren’t returning sorted. The issue I’m having stems from the fact that my custom ordering scheme is completely separate from the one used by SPARQL, resulting in a decidedly unsorted set of results. While I have considered sorting the results from scratch before returning them instead of assuming MarkLogic is returning sorted results, this seems unnecessarily wasteful and it may not even fix my problem.

In my research I have not been able to find any way to set the Collation for SPARQL, nor have I found a way to write a custom Collation. The documentation on this page (https://www.w3.org/TR/rdf-sparql-query/#modOrderBy) specifically states that SPARQL’s ORDER BY is based on a comparison method driven by XPATH’s fn:compare. That function references this page (https://www.w3.org/TR/xpath-functions/#collations) which specifically mentions options for specifying the Collation as well as using alternative implementations of the of the Unicode Collation Algorithm. What I can’t find is anything detailing how to actually do this.

In short, is there any way for me to manipulate or control how a SPARQL query compares characters to affect the final order?

回答1:

If I understand what you're asking, you want to use ORDER BY, OFFSET, and LIMIT to select which results you're going to show, and then you want another ORDER BY to determine the order in which you'll show those results (which might be different than the order that you used to select them). You can do that with a nested query:

select ?result {
  { select ?result where {
      #-- ...
    }
    order by #-- ...
    offset #-- ...
    limit #-- ...
  }
}
order by #-- ...

There's not a whole lot of support for custom orderings, but you can use functions in the order expressions, and you can provide multiple expressions to sort first by one thing, then by another. In your case, it looks like you might want to do something like order lcase(?value) to order case-insensitively. (That won't be perfect, of course. For instances, it's not clear to me whether you want numeric sort for numeric prefixes or not (e.g., should the order be 1, 10, 2, or 1, 2, 10).)

回答2:

I just got a definitive answer from SPARQL implementers.

The SPARQL spec doesn't really address collations. MarkLogic uses unicode codepoint collation for SPARQL ordering.

HOWEVER, we need to know your requirements. MarkLogic as you know supports all kinds of collations, and that support is built into the code backing SPARQL -- we simply have not exposed an interface as to how to leverage collations from SPARQL.

MarkLogic is watching this thread, so feel free to make that request, perhaps with a suggestion of how you would consider accessing collations from the query, and we'll see it.

回答3:

I contacted Kevin Morgan from MarkLogic about this, and he was extremely helpful. We had a WebEx meeting yesterday discussing various solutions to the problem and it went very well.

Their engineers confirmed that so far there is no means of forcing SPARQL to use a particular sorting order. They proposed two promising solutions to my problem:

• Embed your triples within your documents and leverage document searches and range indexes: While this works for multiple system designs, it does not work for ours. Sorting and Pagination fall under a product upgrade and we cannot require our clients to completely re-ingest their data so we can apply this new standard.

• Wrap your SPARQL queries within an XQuery statement: This approach uses SPARQL to determine the entire result set, and then utilizes a custom collation within the XQuery to handle sorting. Pagination is also handled in the XQuery (for the obvious reason that paginating before sorting breaks both).

The second solution seems like it will work for us, but I will need to look into the performance costs before we can seriously consider implementing it. Incidentally, I find it very odd that SPARQL’s sorting does not support collations when the XQuery functions it is built upon do. It seems illogical to assume that its users will never want to sort untagged literal values with anything other than the basic Unicode Codepoint sorting. At what point does it become reasonable for me to take something built upon XQuery and embed it within XQuery because it seems the creators “left something out?”

来源：https://stackoverflow.com/questions/38961492/how-do-you-set-the-collation-for-a-sparql-query

标签

sorting

sparql

marklogic

collation