How can i get the Fuseki API via SPARQLWrapper to properly report a detailed error message?

南楼画角 提交于 2021-01-29 05:40:49


As outlined:in


I tried to allow for a "round trip" operation between python list of dicts and Jena/SPARQL based storage.

The approach performs very well for my usecase and after trying it out for a while i get into more details that need to be addressed.

The stackoverflow question listOfDict to RDF conversion in python targeting Apache Jena Fuseki addresses the initial issues and issues 2-5 show some detail problems that were already fixed.

Now I am working with some 180000 records i'd like to import from 6 different data sources and each data source seems to have new exotic records that make the approach fail.

E.g. one batch of records gives me the following log:

read 45601 events in   0.6 s
storing 45601 events to sparql
  batch for         1 -      2000 of     45601 cr:Event in    0.6 s ->    0.6 s
  batch for      2001 -      4000 of     45601 cr:Event in    0.5 s ->    1.1 s
  batch for      4001 -      6000 of     45601 cr:Event in    0.5 s ->    1.6 s
  batch for      6001 -      8000 of     45601 cr:Event in    0.5 s ->    2.1 s
  batch for      8001 -     10000 of     45601 cr:Event in    0.5 s ->    2.6 s
  batch for     10001 -     12000 of     45601 cr:Event in    0.7 s ->    3.2 s
ERROR: testCrossref (tests.test_Crossref.TestCrossref)
test loading crossref data
Traceback (most recent call last):
  File "/Users/wf/Library/Python/3.8/lib/python/site-packages/SPARQLWrapper/", line 1073, in _query
    response = urlopener(request)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/", line 222, in urlopen
    return, data, timeout)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/", line 531, in open
    response = meth(req, response)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/", line 640, in http_response
    response = self.parent.error(
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/", line 569, in error
    return self._call_chain(*args)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/", line 502, in _call_chain
    result = func(*args)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 400: Bad Request

SPARQLWrapper.SPARQLExceptions.QueryBadFormed: QueryBadFormed: a bad request has been sent to the endpoint, probably the sparql query is bad formed.

b'Error 400: Bad Request\n'

Now since I don't get any details on what the problem is i am working with a binary search. With the error above i only know the problem is with a record with a batchIndex between 12000 and 14000 so I am . setting the limit to 14000 and batchSize to 100 to get closer.

 batch for     13301 -     13400 of     14000 cr:Event in    0.0 s ->    4.3 s

is now the last successful batch. So i am using a binary search: 13450 fail, 13425 fail, 13412 ok, 13418 ok, 13422 fail, 13420 ok, 13421 ok So record 13422 is the culprit and I switch on debug mode to see the INSERT Data created for the record:

  cr:Event__102140gtm20003 cr:Event_name "Higher local fields".
  cr:Event__102140gtm20003 cr:Event_location "M\\"unster, Germany".
  cr:Event__102140gtm20003 cr:Event_source "crossref".
  cr:Event__102140gtm20003 cr:Event_eventId "10.2140/gtm.2000.3".
  cr:Event__102140gtm20003 cr:Event_title "Invitation to higher local fields".
  cr:Event__102140gtm20003 cr:Event_startDate "1999-08-29"^^<>.
  cr:Event__102140gtm20003 cr:Event_year 1999.
  cr:Event__102140gtm20003 cr:Event_month 9.
  cr:Event__102140gtm20003 cr:Event_endDate "1999-09-05"^^<>.

So the Umlaut-encoding "\u" in the location "Münster" is the culprit here. I will work around this issue. The real question is:

How can i get the Fuseki API via SPARQLWrapper to properly report a detailed error message*

e.g. with something like

error in line # cr:Event__102140gtm20003 cr:Event_location "M\\"unster, Germany". is  not a valid triple?

