SPARQL to get all parents of all nodes

问题

I have been using this post to get the parents or lineage of a single RDF node: SPARQL query to get all parent of a node

This works nicely on my virtuoso server. Sorry, couldn't find a public endpoint containing data with a similar structure.

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix bto: <http://purl.obolibrary.org/obo/>
select (group_concat(distinct ?midlab ; separator = "|") AS ?lineage)
where
{ 
  bto:BTO_0000207 rdfs:subClassOf* ?mid .
  ?mid rdfs:subClassOf* ?class .
  ?mid rdfs:label ?midlab .
}
group by ?lineage
order by (count(?mid) as ?ordercount)

giving

+---------------------------------------------------------+
|                         lineage                         |
+---------------------------------------------------------+
| bone|cartilage|connective tissue|tibia|tibial cartilage |
+---------------------------------------------------------+

Then I wondered if I could get the lineage for all nodes by changing the select to

select ?s (group_concat(distinct ?midlab ; separator = "|") AS ?lineage)

and the first line in the where statement to

?s rdfs:subClassOf* ?mid .

Those who have more SPARQL experience than I will probably not be surprised that the query timed out.

Is this a reasonable approach? Am I doing something wrong syntactically?

I suspect that the distinct keyword or group clause are bottlenecks, because this only takes a second or two:

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix bto: <http://purl.obolibrary.org/obo/>
select ?s ?midlab
where
{ 
  ?s rdfs:subClassOf* ?mid .
  ?mid rdfs:subClassOf* ?class .
  ?mid rdfs:label ?midlab .
  ?s <http://www.geneontology.org/formats/oboInOwl#hasOBONamespace> "BrendaTissueOBO"^^<http://www.w3.org/2001/XMLSchema#string> .
}

回答1:

Your first query isn't legal. You can check at sparql.org's query validator. While you can order by count(?mid), you can't bind the value to a variable and order by it in the same clause. That would give you:

select (group_concat(distinct ?midlab ; separator = "|") AS ?lineage)
where
{ 
  bto:BTO_0000207 rdfs:subClassOf* ?mid .
  ?mid rdfs:subClassOf* ?class .
  ?mid rdfs:label ?midlab .
}
group by ?lineage
order by count(?mid)

Now, that's legal, but it doesn't make quite as much sense. group_concat requires that you have some groups, and that you'll do a concatenation for the values within each group. In the absence of a group by clause, you get an implicit group, so the group_concat without a group by is OK. But you've got a group by ?lineage that doesn't make a whole lot of sense, because ?lineage already only has one value per group (since it's already an aggregate). Better would be to group by ?s, as in the following. This seems more correct, and might not time out:

select ?s (group_concat(distinct ?midlab ; separator = "|") AS ?lineage)
where
{ 
  ?s rdfs:subClassOf* ?mid .
  ?mid rdfs:subClassOf* ?class .
  ?mid rdfs:label ?midlab .
}
group by ?s
order by count(?mid)

来源：https://stackoverflow.com/questions/31496982/sparql-to-get-all-parents-of-all-nodes

标签

rdf

sparql

virtuoso