How to read Freebase RDF data? It seems to be a bit broken

心已入冬 提交于 2019-12-11 10:31:36

问题


I'm following instruction of using Freebase RDF parsing for ruby at https://developers.google.com/freebase/v1/rdf-overview

Environment: rdf-1.1.6, rdf-turtle-1.1.4, ruby-2.1.4[ x86_64 ], Ubuntu 14.10

My code is:

require 'rubygems'
require 'cgi'
require 'addressable/uri'
require 'rdf'
require 'rdf/turtle'

topic_id = '/m/0d6lp'
url = Addressable::URI.parse('https://www.googleapis.com/freebase/v1/rdf' + topic_id)

RDF::Turtle::Reader.open(url) do |reader|
  reader.each_statement do |statement|
    puts statement.inspect
  end
end

I get errors:

ERROR [line: 131] With input '"Cidade e Condado de S\xe3o Francisco"@pt;
    ns:common.topic.alias    "City and County of San Franc': Invalid token "\"Cidade" (found "\"Cidade"), production = :_predicateObjectList_5
ERROR [line: 131] Expected one of ["a", :IRIREF, :PNAME_LN, :PNAME_NS] (found "\"City and County of San Francisco\""(STRING_LITERAL_QUOTE)), production = :predicateObjectList
ERROR [line: 132] Expected one of ["a", :IRIREF, :PNAME_LN, :PNAME_NS] (found "\"SF\""(STRING_LITERAL_QUOTE)), production = :predicateObjectList
ERROR [line: 133] Expected one of ["a", :IRIREF, :PNAME_LN, :PNAME_NS] (found "\"Frisco\""(STRING_LITERAL_QUOTE)), production = :predicateObjectList
ERROR [line: 134] Expected one of ["a", :IRIREF, :PNAME_LN, :PNAME_NS] (found "\"The City by the Bay\""(STRING_LITERAL_QUOTE)), production = :predicateObjectList
ERROR [line: 135] Expected one of ["a", :IRIREF, :PNAME_LN, :PNAME_NS] (found "\"San Francisco, Kalifornija\""(STRING_LITERAL_QUOTE)), production = :predicateObjectList
ERROR [line: 136] Expected one of ["a", :IRIREF, :PNAME_LN, :PNAME_NS] (found "\"San Francisco\""(STRING_LITERAL_QUOTE)), production = :predicateObjectList
ERROR [line: 137] Expected one of ["a", :IRIREF, :PNAME_LN, :PNAME_NS] (found "\"San Francisco, Calif.\""(STRING_LITERAL_QUOTE)), production = :predicateObjectList
ERROR [line: 138] Expected one of ["a", :IRIREF, :PNAME_LN, :PNAME_NS] (found "\"City by the Bay - San Francisco\""(STRING_LITERAL_QUOTE)), production = :predicateObjectList
ERROR [line: 139] Expected one of ["a", :IRIREF, :PNAME_LN, :PNAME_NS] (found "\"La Ciutat i el Comtat de San Francisco\""(STRING_LITERAL_QUOTE)), production = :predicateObjectList
ERROR [line: 140] Expected one of ["a", :IRIREF, :PNAME_LN, :PNAME_NS] (found "\"Yerba Buena\""(STRING_LITERAL_QUOTE)), production = :predicateObjectList
ERROR [line: 141] Expected one of ["a", :IRIREF, :PNAME_LN, :PNAME_NS] (found "\"La ciutat i comtat de San Francisco\""(STRING_LITERAL_QUOTE)), production = :predicateObjectList
ERROR [line: 142] Expected one of ["a", :IRIREF, :PNAME_LN, :PNAME_NS] (found "\"San Franciskas\""(STRING_LITERAL_QUOTE)), production = :predicateObjectList
ERROR [line: 143] Expected one of ["a", :IRIREF, :PNAME_LN, :PNAME_NS] (found "\"San Francisco, Kalifornija\""(STRING_LITERAL_QUOTE)), production = :predicateObjectList
ERROR [line: 144] Expected one of ["a", :IRIREF, :PNAME_LN, :PNAME_NS] (found "\"旧金山\""(STRING_LITERAL_QUOTE)), production = :predicateObjectList
ERROR [line: 145] Expected one of [:IRIREF, :BLANK_NODE_LABEL, :ANON, "(", "[", :PNAME_LN, :PNAME_NS, :INTEGER, :DECIMAL, :DOUBLE, "true", "false", :STRING_LITERAL_QUOTE, :STRING_LITERAL_SINGLE_QUOTE, :STRING_LITERAL_LONG_SINGLE_QUOTE, :STRING_LITERAL_LONG_QUOTE] (found ";"), production = :objectList
ERROR [line: 146] Expected one of [:IRIREF, :BLANK_NODE_LABEL, :ANON, "(", "[", :PNAME_LN, :PNAME_NS, :INTEGER, :DECIMAL, :DOUBLE, "true", "false", :STRING_LITERAL_QUOTE, :STRING_LITERAL_SINGLE_QUOTE, :STRING_LITERAL_LONG_SINGLE_QUOTE, :STRING_LITERAL_LONG_QUOTE] (found ";"), production = :objectList
ERROR [line: 147] Expected one of ["a", :IRIREF, :PNAME_LN, :PNAME_NS] (found "\"Bandar raya dan Daerah San Francisco merupakan bandar raya keempat paling ramai penduduk di California dan keempat belas di Amerika Syarikat, dengan anggaran penduduk seramai 744,041 pada 2006. Ia terletak di hujung Semenanjung San Francisco dan merupakan titik fokus kewangan, kebudayaan serta pengangkutan kawasan metropolitan San Francisco Bay Area. San Francisco merupakan bandar utama kedua paling padat di Amerika Syarikat.\nPada 1776, orang-orang Sepanyol menduduki hujung semenanjung San Francisco, dan mendirikan sebuah kubu dan misi di Golden Gate. Kerubut Emas California pada 1848 mendorong bandar ini berkembang dengan pesat. Selepas dimusnahkan dalam Gempa bumi San Francisco 1906, San Francisco telah dibina semula dengan cepat.\nSemasa tahun 1960-an, kawasan Haight-Ashbury di San Francisco menjadi terkenal apabila menjadi pusat budaya hippie apabila ribuan golongan muda dan seniman bermigrasi ke lokasi tersebut. Walaupun Haight-Ashbury telah mengalami gentrifikasi dan hilang identiti budaya hippie dalam dekad-dekad berikutnya, San Francisco telah menjadi sinonim dengan budaya dan nostalgia hippie.\""(STRING_LITERAL_QUOTE)), production = :predicateObjectList
ERROR [line: 148] Expected one of ["a", :IRIREF, :PNAME_LN, :PNAME_NS] (found "\"ซานฟรานซิสโก หรือ แซนแฟรนซิสโก คือเมืองในรัฐแคลิฟอร์เนีย สหรัฐอเมริกา มีประชากร ประมาณ 808,976 คน ซึ่งเป็นเมืองที่มีความหนาแน่นประชากรเป็นอันดับสองของประเทศ เมืองซานฟรานซิสโกตั้งอยู่บริเวณอ่าวซานฟรานซิสโก\nชาวยุโรปกลุ่มแรกที่มาตั้งรกรากในซานฟรานซิสโกคือชาวสเปน โดยในปี ค.ศ. 1776 เมืองมีชื่อว่า เซนต์ฟรานซิส ในภายหลังจากช่วงยุคตื่นทองในปี ค.ศ. 1848 ทำให้ประชากรในซานฟรานซิสโกเพิ่มขึ้นอย่างรวดเร็ว และเมืองเติบโตอย่างมาก ถึงแม้ว่าซานฟรานซิสโกจะประสบปัญหา แผ่นดินไหวและไฟไหม้ขนาดใหญ่ในช่วงปี ค.ศ. 1906 ซานฟรานซิสโกกลับฟื้นตัวได้อย่างรวดเร็ว และได้ชื่อว่าเป็นเมืองสำคัญเมืองหนึ่งในแถบชายฝั่งตะวันตกของประเทศ\nซานฟรานซิสโกมีลักษณะภูมิประเทศที่เป็นเขา และมีชายฝั่งติดกับมหาสมุทรแปซิฟิก สัญลักษณ์ที่ขึ้นชื่อของเมืองซานฟรานซิสโกได้แก่ สะพานโกลเดนเกต และแหล่งท่องเที่ยวที่มีชื่อเสียงได้แก่ เกาะอัลคาทราซ รถรางซานฟรานซิสโก Pier 39 และ ถนนลอมบาร์ด ทีมกีฬา อเมริกันฟุตบอล ที่สำคัญได้แก่ ซานฟรานซิสโก 49ers เป็นเมืองเศรษฐกิจที่มีขนาดใหญ่ และชาวเอเชียอาศัยที่อ่าวซานฟรานซิโกเป็นจำนวนมาก\""(STRING_LITERAL_QUOTE)), production = :predicateObjectList
ERROR [line: 149] Expected one of ["a", :IRIREF, :PNAME_LN, :PNAME_NS] (found "\"San Francisco je četrto največje mesto v Kaliforniji; hkrati je tudi okrožje. Ocena prebivalcev iz leta 2004 je 744.230.\nSamo mesto leži na skrajnem delu polotoka San Francisco, hkrati pa zajema več otokov v zalivu San Francisca in ožini Golden Gate.\""(STRING_LITERAL_QUOTE)), production = :predicateObjectList
ERROR [line: 150] With input '"\uc0cc\ud504\ub780\uc2dc\uc2a4\ucf54\ub294 \ubbf8\uad6d \uce98\ub9ac\ud3ec\ub2c8\uc544 \uc8fc \uc911': Invalid token "\"\\uc0cc\\ud504\\ub780\\uc2dc\\uc2a4\\ucf54\\ub294" (found "\"\\uc0cc\\ud504\\ub780\\uc2dc\\uc2a4\\ucf54\\ub294"), production = :_triples_1
ERROR [line: 150] Expected one of ["a", :IRIREF, :PNAME_LN, :PNAME_NS] (found "\"Сан-Франциско — місто на західному узбережжі США у штаті Каліфорнія, порт, осередок індустрії й торгівлі, осередок багатьох дослідницьких інститутів, зокрема Каліфорнійського університету та Університету штату Каліфорнія, населення 805 000 мешканців, близько 3 000 українців.\""(STRING_LITERAL_QUOTE)), production = :predicateObjectList
ERROR [line: 151] With input '"\u820a\u91d1\u5c71\uff0c\u6b63\u5f0f\u540d\u7a31\u70ba\u820a\u91d1\u5c71\u5e02\u90e1\uff0c\u662f\u7f': Invalid token "\"\\u820a\\u91d1\\u5c71\\uff0c\\u6b63\\u5f0f\\u540d\\u7a31\\u70ba\\u820a\\u91d1\\u5c71\\u5e02\\u90e1\\uff0c\\u662f\\u7f" (found "\"\\u820a\\u91d1\\u5c71\\uff0c\\u6b63\\u5f0f\\u540d\\u7a31\\u70ba\\u820a\\u91d1\\u5c71\\u5e02\\u90e1\\uff0c\\u662f\\u7f8e\\u570b\\u52a0\\u5229\\u798f\\u5c3c\\u4e9e\\u5dde\\u5317\\u90e8\\u7684\\u4e00\\u5ea7\\u90fd\\u5e02\\uff0c\\u4e5f\\u662f\\u52a0\\u5dde\\u552f\\u4e00\\u5e02\\u90e1\\u5408\\u4e00\\u7684\\u884c\\u653f\\u5340\\uff0c\\u4e2d\\u6587\\u53c8\\u97f3\\u8b6f\\u70ba\\u4e09\\u85e9\\u5e02\\u548c\\u8056\\xb7\\u5f17\\u6717\\u897f\\u65af\\u79d1\\uff0c\\u4ea6\\u5225\\u540d\\u300c\\u91d1\\u9580\\u57ce\\u5e02\\u300d\\u3001\\u300c\\u7063\\u908a\\u4e4b\\u57ce\\u300d\\u3001\\u300c\\u9727\\u57ce\\u300d\\u7b49\\u3002\\u4f4d\\u65bc\\u820a\\u91d1\\u5c71\\u534a\\u5cf6\\u7684\\u5317\\u7aef\\uff0c\\u6771\\u81e8\\u820a\\u91d1\\u5c71\\u7063\\u3001\\u897f\\u81e8\\u592a\\u5e73\\u6d0b\\uff0c\\u4eba\\u53e3\\u7d0483\\u842c\\uff0c\\u70ba\\u52a0\\u5dde\\u7b2c\\u56db\\u5927\\u57ce\\u3002\\u5176\\u8207\\u5357\\u908a\\u7684\\u8056\\u99ac\\u5201\\u90e1\\u3001\\u5357\\u7063\\u7684\\u8056\\u8377\\u897f\\u8207\\u77fd\\u8c37\\u5730\\u5340\\u3001\\u6771\\u7063\\u7684\\u5967\\u514b\\u862d\\u8207\\u67cf\\u514b\\u840a\\u3001\\u4ee5\\u53ca\\u5317\\u908a\\u7684\\u99ac\\u6797\\u90e1\\u8207\\u7d0d\\u5e15\\u90e1\\u5408\\u7a31\\u70ba\\u820a\\u91d1\\u5c71\\u7063\\u5340\\u3002\\n\\u820a\\u91d1\\u5c71\\u662f\\u5317\\u52a0\\u5dde\\u8207\\u820a\\u91d1\\u5c71\\u7063\\u5340\\u7684\\u5546\\u696d\\u8207\\u6587\\u5316\\u767c\\u5c55\\u4e2d\\u5fc3\\uff0c\\u7576\\u5730\\u4f4f\\u6709\\u5f88\\u591a\\u85dd\\u8853\\u5bb6\\u3001\\u4f5c\\u5bb6\\u548c\\u6f14\\u54e1\\uff0c\\u572820\\u4e16\\u7d00\\u53ca21\\u4e16\\u7d00\\u521d\\u4e00\\u76f4\\u662f\\u7f8e\\u570b\\u563b\\u76ae\\u6587\\u5316\\u548c\\u8fd1\\u4ee3\\u81ea\\u7531\\u4e3b\\u7fa9\\u3001\\u9032\\u6b65\\u4e3b\\u7fa9\\u7684\\u4e2d\\u5fc3\\u4e4b\\u4e00\\u3002\\u9019\\u500b\\u57ce\\u5e02\\u540c\\u6a23\\u4ee5\\u5176\\u773e\\u591a\\u7684\\u7db2\\u969b\\u7db2\\u8def\\u516c\\u53f8\\u800c\\u805e\\u540d\\uff0c\\u540c\\u6642\\u4e5f\\u6210\\u70ba\\u4e86\\u5ee3\\u5927\\u540c\\u6027\\u6200\\u8005\\u7684\\u805a\\u5c45\\u5730\\u3002\\u820a\\u91d1\\u5c71\\u4e5f\\u662f\\u53d7\\u6b61\\u8fce\\u7684\\u65c5\\u904a\\u76ee\\u7684\\u5730\\uff0c\\u8207\\u5176\\u6dbc\\u723d\\u7684\\u590f\\u5b63\\u3001\\u591a\\u9727\\u3001\\u7dbf\\u5ef6\\u7684\\u4e18\\u9675\\u5730\\u5f62\\u3001\\u6df7\\u5408\\u7684\\u5efa\\u7bc9\\u98a8\\u683c\\uff0c\\u548c\\u91d1\\u9580\\u5927\\u6a4b\\u3001\\u7e9c\\u8eca\\u3001\\u60e1\\u9b54\\u5cf6\\u76e3\\u7344\\u53ca\\u4e2d\\u570b\\u57ce\\u7b49\\u666f\\u9ede\\u805e\\u540d\\u3002\\u6b64\\u5916\\uff0c\\u820a\\u91d1\\u5c71\\u4e5f\\u662f\\u4e94\\u5927\\u4e3b\\u8981\\u9280\\u884c\\u548c\\u8a31\\u591a\\u5927\\u578b\\u516c\\u53f8\\u7684\\u7e3d\\u90e8\\u6240\\u5728\\uff0c\\u5305\\u62ec\\u84cb\\u749e\\u3001\\u592a\\u5e73\\u6d0b\\u74e6\\u96fb\\u516c\\u53f8\\u3001Yelp\\u3001Pinterest\\u3001Twitter\\u3001\\u512a\\u6b65\\u3001Mozilla\\u548cCraigslist\\u7b49\\u3002\"@zh-TW;"), production = :_triples_1
ERROR [line: 151] With input '"San Francisco er en amerikansk by i delstaten Californien. Byen er med sine 837.442 indbyggere Calif': Invalid token "\"San" (found "\"San"), production = :_triples_1
ERROR [line: 151] With input '"San Francisco is een stad in de Amerikaanse staat Californi\xeb en het hart van de San Francisco Bay': Invalid token "\"San" (found "\"San"), production = :_triples_1
ERROR [line: 151] undefined prefix "county"
ERROR [line: 151] With input 'officieel heet ze City and County of San Francisco.\nDe stad, die 805.235 inwoners telt, ligt op het ': Invalid token "officieel" (found "officieel"), production = :_triples_1
ERROR [line: 151] With input '"San Francisco, ofici\xe1ln\u011b: M\u011bsto a Okres San Francisco, je velk\xe9 m\u011bsto na z\xe1p': Invalid token "\"San" (found "\"San"), production = :_triples_1
ERROR [line: 151] With input 'nejlidnat\u011bj\u0161\xedm m\u011bstem st\xe1tu Kalifornie a 14. nejlidnat\u011bj\u0161\xedm m\u011b': Invalid token "nejlidnat\\u011bj\\u0161\\xedm" (found "nejlidnat\\u011bj\\u0161\\xedm"), production = :_turtleDoc_1

I also tried Python and C# RDF Turtle libraries - all of them complain about \x I tried to fix it manually replacing \x to \u00 in the string, but then it starts complain about unescaped double quotes in long string literals.

The error above is the error I got using official Google code example.

Is Freebase RDF broken? Do I write something wrong? How to handle Freebase RDF in the right way?

Thank you.


回答1:


As discussed in Jena parsing issue for freebase RDF dump (Jan 2014), the Freebas dumps aren't always legal Turtle/N3. In this, you're grabbing the data from

  • https://www.googleapis.com/freebase/v1/rdf/m/0d6lp

When I try to parse that with Jena, I get this error:

09:24:01 ERROR riot                 :: [line: 131, col: 54] illegal escape sequence value: x (0x78)
Exception in thread "main" org.apache.jena.riot.RiotException: [line: 131, col: 54] illegal escape sequence value: x (0x78)

As you noted, the immediate issue is that Turtle strings shouldn't have \x escapes. Turtle supports a few different kinds of escapes (see § 6.4 Escape Sequences) and it looks like these ought to be of the form \uXXXX (or \uXXXXXXXX). Line 131 is:

    ns:common.topic.alias    "Cidade e Condado de S\xe3o Francisco"@pt;

We can fix it by replacing the \x with \u00, so we end up with \u00e3. Sure enough, we can parse this as a separate file:

[] <ns:common.topic.alias> "Cidade e Condado de S\u00e3o Francisco"@pt .
<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:j.0="ns:">
  <rdf:Description>
    <j.0:common.topic.alias xml:lang="pt">Cidade e Condado de São Francisco</j.0:common.topic.alias>
  </rdf:Description>
</rdf:RDF>

You can try globally replacing \x with \u00, but that won't fix all the problems in the file. After that, I end up with

09:34:27 ERROR riot                 :: [line: 170, col: 653] Unknown char: \(92;0x005C)
Exception in thread "main" org.apache.jena.riot.RiotException: [line: 170, col: 653] Unknown char: \(92;0x005C)

That's not a particularly helpful error, but I think what's going on here is that line 170 is like this (where I've replaced legal \uXXXX escapes):

ns:common.topic.description     "&hellip;"\u05e8. &hellip;"@iw;

I'd guess that the that second quotation mark should be escaped, but since it's not, it's seen as the end of a string. That means that the next character read is \ from \u05e8, and \ doesn't make sense in that location (either a comma, semicolon, at-sign, circumflex, or dot would make sense).

I finally got a version of this that I could parse after peforming a few transformations, but these are obviously a bit ad hoc.

  1. Replace all \x with \x00.
  2. Since it appears that there's just one string per line, replace the first " on a line with """, and replace the last " on a line with """. This means that " ... " ... " gets turned into """ ... " ... """, which is legal.
  3. There are dollar signs in a bunch of the names, and offhand I think that's illegal. I replaced them them DOLLAR. This isn't good though, because in some places $, like \x, should be replaced by \u. E.g.:

    key:wikipedia.ca    "San_Francisco_$0028Calif$00F2rnia$0029";
    ns:type.object.key    ns:wikipedia.fr.San_Francisco_$0028Californie$0029;
    

So, the result isn't good, but it can be parsed. I generated it with sed:

sed -r -e 's/\\x/\\u00/g ; s/^([^"]*)"/\1"""/ ; s/"([^"]*)$/"""\1/ ; s/[$]/DOLLAR/g' 0d6lp.ttl



回答2:


If you are interested in Freebase in RDF form you may be better off using a pre-cleaned version like BaseKB which is specifically processed to be legal RDF and remove various junk data:

:BaseKB is an RDF knowledge base derived from Freebase, a major source of the Google Knowledge Graph; :BaseKB contains about half as many facts as the Freebase dump because it removes trivial, ill-formed and repetitive facts that make processing difficult.



来源:https://stackoverflow.com/questions/26779977/how-to-read-freebase-rdf-data-it-seems-to-be-a-bit-broken

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!