Wikidata results sorted by something similar to a PageRank

半世苍凉 提交于 2019-12-17 16:36:06

问题


In Wikidata (Wikidata SPARQL endpoint), is there a way to order the SPARQL query results with something like a PageRank?

SELECT DISTINCT ?entity ?entityLabel WHERE {
    ?entity wdt:P31 wd:Q5.
    SERVICE wikibase:label {
     bd:serviceParam wikibase:language "en" .
    }
} LIMIT 100 OFFSET 0

Can we specify a field to order the results by and that field expresses that the entity at the top is more notable/important/recognizable that the following one and so on?


回答1:


It seems that PageRank does not make much sense in relation to Wikidata. Obviously, large classes and large aggregates will be leaders.

Also, unlike web links, RDF predicates are "navigable" from both sides; this is just a matter of design, which URI is a subject and which URI is an object.

However, Andreas Thalhammer continues his work. Top 10 Wikidata entities are:

Q729    animal      24996.77
Q30     USA         24772.45
Q1360   Arthropoda  16930.883
Q1390   insects     16531.822
Q35409  family      14403.091
Q756    plant       14019.927
Q142    France      13723.484
Q34740  genus       13718.484
Q16     Canada      12321.178
Q159    Russia      11707.16

Unfortunately, Wikidata pageranks are not published on the (same) endpoint, one can't query them using SPARQL.


Fortunately, one can figure out some kind of a rank oneself. Possible options are:

  1. Number of outcoming statements (precalculated);
  2. Number of sitelinks (precalculated);
  3. Number of incoming statements (in the example below, only truthy statements are counted).

Example query:

SELECT ?item ?itemLabel ?outcoming ?sitelinks ?incoming {
    ?item wdt:P463 wd:Q458 .
    ?item wikibase:statements ?outcoming .
    ?item wikibase:sitelinks ?sitelinks .
       {
       SELECT (count(?s) AS ?incoming) ?item WHERE {
           ?item wdt:P463 wd:Q458 .
           ?s ?p ?item .
           [] wikibase:directClaim ?p 
      } GROUP BY ?item
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }.  
} ORDER BY DESC (?incoming)

Try it!

As of October 2017, all these metrics are more or less correlated.

Here below are correlation coefficients of these measures for the EU members.

Pearson
-------
          outcoming sitelinks incoming pagerank    
outcoming    1.0000    0.6907   0.7416   0.8652
sitelinks    0.6907    1.0000   0.4314   0.5717
incoming     0.7416    0.4314   1.0000   0.8978
pagerank     0.8652    0.5717   0.8978   1.0000


Spearman
--------
          outcoming sitelinks incoming pagerank
outcoming    1.0000    0.6869   0.7619   0.8736
sitelinks    0.6869    1.0000   0.7680   0.8342
incoming     0.7619    0.7680   1.0000   0.8872
pagerank     0.8736    0.8342   0.8872   1.0000


Kendall
-------  
          outcoming sitelinks incoming pagerank
outcoming    1.0000    0.4914   0.5661   0.7143
sitelinks    0.4914    1.0000   0.5764   0.6454
incoming     0.5661    0.5764   1.0000   0.7249
pagerank     0.7143    0.6454   0.7249   1.0000

See also:

  • https://phabricator.wikimedia.org/T143424
  • https://wiki.blazegraph.com/wiki/index.php/RDF_GAS_API#PageRank
  • https://phabricator.wikimedia.org/T162279


来源:https://stackoverflow.com/questions/39438022/wikidata-results-sorted-by-something-similar-to-a-pagerank

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!