To use iSPARQL to compare values using similarity measures

≯℡__Kan透↙ 提交于 2019-11-29 08:04:56

Short Version — You can do some of this in pure SPARQL

You can use a query like this to find cities whose names are like "Londn", and order them by (one measure of) similarity. The rest of the answer explains how this works:

select ?city ?percent where {
  ?city a dbpedia-owl:City ;
        rdfs:label ?label .
  filter langMatches( lang(?label), 'en' )

  bind( replace( concat( 'x', str(?label) ), "^x[^Londn]*([L]?)[^ondn]*([o]?)[^ndn]*([n]?)[^dn]*([d]?)[^n]*([n]?).*$", '$1$2$3$4$5' ) as ?match )
  bind( xsd:float(strlen(?match))/strlen(str(?label)) as ?percent )
}
order by desc(?percent)
limit 100

SPARQL results

city                                  percent
----------------------------------------------
http://dbpedia.org/resource/London    0.833333
http://dbpedia.org/resource/Bonn      0.75
http://dbpedia.org/resource/Loudi     0.6
http://dbpedia.org/resource/Ladnu     0.6
http://dbpedia.org/resource/Lonar     0.6
http://dbpedia.org/resource/Longnan   0.571429
http://dbpedia.org/resource/Longyan   0.571429
http://dbpedia.org/resource/Luoding   0.571429
http://dbpedia.org/resource/Lodhran   0.571429
http://dbpedia.org/resource/Lom%C3%A9 0.5
http://dbpedia.org/resource/Andong    0.5

Computing string similarity metrics

Note: the code in this part of this answer works in Apache Jena. There's actually an edge case that causes this to (correctly) fail in Virtuoso. The update at the end addresses this issue.

There's nothing built into SPARQL for computing string matching distances, but you can do some of it using the regular expression replacement mechanism in SPARQL. Suppose you want to match the sequence "cat" in some strings. Then you can use a query like this to figure out how much of a given string is present in the sequence "cat":

select ?string ?match where {
  values ?string { "cart" "concatenate" "hat" "pot" "hop" }
  bind( replace( ?string, "^[^cat]*([c]?)[^at]*([a]?)[^t]*([t]?).*$", "$1$2$3" ) as ?match )
}
-------------------------
| string        | match |
=========================
| "cart"        | "cat" |
| "concatenate" | "cat" |
| "hat"         | "at"  |
| "pot"         | "t"   |
| "hop"         | ""    |
-------------------------

By examining the length of string and match, you should be able to compute some various similarity metrics. As a more complicated example that uses the "Londn" input you mentioned. The percent column is the percent of the string that matched the input.

select ?input
       ?string
       (strlen(?match)/strlen(?string) as ?percent)
where {
  values ?string { "London" "Londn" "London Fog" "Lando" "Land Ho!"
                   "concatenate" "catnap" "hat" "cat" "chat" "chart" "port" "part" }

  values (?input ?pattern ?replacement) {
    ("cat"   "^[^cat]*([c]?)[^at]*([a]?)[^t]*([t]?).*$"                              "$1$2$3")
    ("Londn" "^[^Londn]*([L]?)[^ondn]*([o]?)[^ndn]*([n]?)[^dn]*([d]?)[^n]*([n]?).*$" "$1$2$3$4$5")
  }

  bind( replace( ?string, ?pattern, ?replacement) as ?match )
}
order by ?pattern desc(?percent)
--------------------------------------------------------
| input   | string        | percent                    |
========================================================
| "Londn" | "Londn"       | 1.0                        |
| "Londn" | "London"      | 0.833333333333333333333333 |
| "Londn" | "Lando"       | 0.6                        |
| "Londn" | "London Fog"  | 0.5                        |
| "Londn" | "Land Ho!"    | 0.375                      |
| "Londn" | "concatenate" | 0.272727272727272727272727 |
| "Londn" | "port"        | 0.25                       |
| "Londn" | "catnap"      | 0.166666666666666666666666 |
| "Londn" | "cat"         | 0.0                        |
| "Londn" | "chart"       | 0.0                        |
| "Londn" | "chat"        | 0.0                        |
| "Londn" | "hat"         | 0.0                        |
| "Londn" | "part"        | 0.0                        |
| "cat"   | "cat"         | 1.0                        |
| "cat"   | "chat"        | 0.75                       |
| "cat"   | "hat"         | 0.666666666666666666666666 |
| "cat"   | "chart"       | 0.6                        |
| "cat"   | "part"        | 0.5                        |
| "cat"   | "catnap"      | 0.5                        |
| "cat"   | "concatenate" | 0.272727272727272727272727 |
| "cat"   | "port"        | 0.25                       |
| "cat"   | "Lando"       | 0.2                        |
| "cat"   | "Land Ho!"    | 0.125                      |
| "cat"   | "Londn"       | 0.0                        |
| "cat"   | "London"      | 0.0                        |
| "cat"   | "London Fog"  | 0.0                        |
--------------------------------------------------------

Update

The code above works in Apache Jena, but fails in Virtuoso, due to the fact that the pattern can match the empty string. E.g., if you try the following query on DBpedia's endpoint (which is powered by Virtuoso), you'll get the following error:

select (replace( "foo", ".*", "x" ) as ?bar) where {}

Virtuoso 22023 Error The regex-based XPATH/XQuery/SPARQL replace() function can not search for a pattern that can be found even in an empty string

This surprised me, but the spec for replace says that it's based on XPath fn:replace. The docs for fn:replace say:

An error is raised [err:FORX0003] if the pattern matches a zero-length string, that is, if the expression fn:matches("", $pattern, $flags) returns true. It is not an error, however, if a captured substring is zero-length.

We can get around this problem, though, by adding a character to the beginning of both the pattern and the string:

select ?input
       ?string
       (strlen(?match)/strlen(?string) as ?percent)
where {
  values ?string { "London" "Londn" "London Fog" "Lando" "Land Ho!"
                   "concatenate" "catnap" "hat" "cat" "chat" "chart" "port" "part" }

  values (?input ?pattern ?replacement) {
    ("cat"   "^x[^cat]*([c]?)[^at]*([a]?)[^t]*([t]?).*$"                              "$1$2$3")
    ("Londn" "^x[^Londn]*([L]?)[^ondn]*([o]?)[^ndn]*([n]?)[^dn]*([d]?)[^n]*([n]?).*$" "$1$2$3$4$5")
  }

  bind( replace( concat('x',?string), ?pattern, ?replacement) as ?match )
}
order by ?pattern desc(?percent)
--------------------------------------------------------
| input   | string        | percent                    |
========================================================
| "Londn" | "Londn"       | 1.0                        |
| "Londn" | "London"      | 0.833333333333333333333333 |
| "Londn" | "Lando"       | 0.6                        |
| "Londn" | "London Fog"  | 0.5                        |
| "Londn" | "Land Ho!"    | 0.375                      |
| "Londn" | "concatenate" | 0.272727272727272727272727 |
| "Londn" | "port"        | 0.25                       |
| "Londn" | "catnap"      | 0.166666666666666666666666 |
| "Londn" | "cat"         | 0.0                        |
| "Londn" | "chart"       | 0.0                        |
| "Londn" | "chat"        | 0.0                        |
| "Londn" | "hat"         | 0.0                        |
| "Londn" | "part"        | 0.0                        |
| "cat"   | "cat"         | 1.0                        |
| "cat"   | "chat"        | 0.75                       |
| "cat"   | "hat"         | 0.666666666666666666666666 |
| "cat"   | "chart"       | 0.6                        |
| "cat"   | "part"        | 0.5                        |
| "cat"   | "catnap"      | 0.5                        |
| "cat"   | "concatenate" | 0.272727272727272727272727 |
| "cat"   | "port"        | 0.25                       |
| "cat"   | "Lando"       | 0.2                        |
| "cat"   | "Land Ho!"    | 0.125                      |
| "cat"   | "Londn"       | 0.0                        |
| "cat"   | "London"      | 0.0                        |
| "cat"   | "London Fog"  | 0.0                        |
--------------------------------------------------------
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!