wikipedia | 易学教程

Finding and downloading images within the Wikipedia Dump

阅读更多关于 Finding and downloading images within the Wikipedia Dump

问题 I'm trying to find a comprehensive list of all images on wikipedia, which I can then filter down to the public domain ones. I've downloaded the SQL dumps from here: http://dumps.wikimedia.org/enwiki/latest/ And studied the DB schema: http://upload.wikimedia.org/wikipedia/commons/thumb/4/42/MediaWiki_1.20_%2844edaa2%29_database_schema.svg/2193px-MediaWiki_1.20_%2844edaa2%29_database_schema.svg.png I think I understand it but when I pick a sample image from a wikipedia page I can't find it

How can I extract specific links in Wikipedia articles using jsoup?

阅读更多关于 How can I extract specific links in Wikipedia articles using jsoup?

问题 I am doing an NLP project and I need to know how to extract links that only are in the "introduction" section and in the "geography" section of this wikipidia page: http://en.wikipedia.org/wiki/Boston Could you please help me? 回答1: Wikipedia does not make this easy. I don't claim this to be elegant or even very reuseable. Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Boston").timeout(5000).get(); Element intro = doc.body().select("p").first(); while (intro.tagName().equals("p"))

How to get associated (English) Wikipedia page from Wikidata page / Q number using Wikidata dump?

阅读更多关于 How to get associated (English) Wikipedia page from Wikidata page / Q number using Wikidata dump?

问题 For @en text alone, a single item from the Wikidata dump contains multiple names: <http://www.wikidata.org/entity/Q26> <http://www.w3.org/2000/01/rdf-schema#label> "Northern Ireland"@en . <http://www.wikidata.org/entity/Q26> <http://www.w3.org/2004/02/skos/core#prefLabel> "Northern Ireland"@en . <http://www.wikidata.org/entity/Q26> <http://schema.org/name> "Northern Ireland"@en . On the Wikidata page for this article (http://www.wikidata.org/entity/Q26), which of these (if any) corresponds to

How to get the result of “all pages with prefix” using Wikipedia api?

阅读更多关于 How to get the result of “all pages with prefix” using Wikipedia api?

问题 I wish to use Wikipedia api to extract the result of this page: http://en.wikipedia.org/wiki/Special:PrefixIndex When searching "something" on it, for example this: http://en.wikipedia.org/w/index.php?title=Special%3APrefixIndex&prefix=tal&namespace=4 Then, I would like to access each of the resulting pages and extract their information. What api call might I use? 回答1: You can use list=allpages and specify apprefix . For example: http://en.wikipedia.org/w/api.php?format=xml&action=query&list

How to scrape Subcategories and pages in categories of a Category wikipedia page using Python

阅读更多关于 How to scrape Subcategories and pages in categories of a Category wikipedia page using Python

问题 So I'm trying to scrape all the subcategories and pages under the category header of the Category page: "Category: Class-based programming languages" found at: https://en.wikipedia.org/wiki/Category:Class-based_programming_languages I've figured out a way to do this using urls and the mediawiki API: Categorymembers. The way to do that would be: base: en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle=Category:Class-based%20programming%20languages&format=json&cmlimit=500 base

Issues with wikipedia dump table pagelinks

阅读更多关于 Issues with wikipedia dump table pagelinks

问题 I downloaded the enwiki-latest-pagelinks.sql.gz dump from dumps.wikimedia.org/enwiki/latest/ . I upacked the file, its uncompressed size is 37G. The table structure is this: SHOW CREATE TABLE wp_dump.pagelinks; CREATE TABLE `pagelinks` ( `pl_from` int(8) unsigned NOT NULL DEFAULT '0', `pl_namespace` int(11) NOT NULL DEFAULT '0', `pl_title` varbinary(255) NOT NULL DEFAULT '', `pl_from_namespace` int(11) NOT NULL DEFAULT '0', UNIQUE KEY `pl_from` (`pl_from`,`pl_namespace`,`pl_title`), KEY `pl

extract unidentified html content from between two tags, using jsoup? regex?

阅读更多关于 extract unidentified html content from between two tags, using jsoup? regex?

问题 I want to get the names of all those links from between the two h2 tags there <h2><span class="mw-headline" id="People">People</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Bush&action=edit&section=1" title="Edit section: People">edit</a><span class="mw-editsection-bracket">]</span></span></h2> <ul> <li><a href="/wiki/George_H._W._Bush" title="George H. W. Bush">George H. W. Bush</a> (born 1924), the 41st president of the United

Using SPARQL to query DBPedia Company Information

阅读更多关于 Using SPARQL to query DBPedia Company Information

问题 I'm trying to query DBPedia using SPARQL only to find company information such as a description, and a logo. I'm rather lost with devising the SPARQL Query to do this. SELECT DISTINCT ?subject ?employees ?homepage WHERE { ?subject rdf:type <http://dbpedia.org/class/yago/Company108058098> . ?subject dbpedia2:numEmployees ?employees FILTER ( xsd:integer(?employees) >= 50000 ) . ?subject foaf:homepage ?homepage . } ORDER BY DESC(xsd:integer(?employees)) LIMIT 20 I have come across the above

Easy way to export Wikipedia's translated titles

阅读更多关于 Easy way to export Wikipedia's translated titles

问题 Is there an easy way to export Wikipedia's translated titles to get a set like this: russian_title -> english_title ? I tried to get ones from ruwiki-latest-pages-meta-current.xml.bz2 and ruwiki-latest-pages-articles.xml.bz2, however, there are less than 25k translations. I found out some are not present. E.g. one can see a link to English wiki here, but there is no link [[en:Yandex]] in the dump. Maybe I should try to parse English Wikipedia, but I'm sure there is a nicer solution. BTW, I'm

How to get coordinates from a Wikipedia page through API?

阅读更多关于 How to get coordinates from a Wikipedia page through API?

问题 I want to get the coordinates of a Wikipedia page through their API. I want to put the page title as 'titles' parameter. I have searched SO for a solution but seems they are scrapping the page then extracting. Is it possible through their API? 回答1: You need to use Wikipedia API. For your example with Kinkaku-ji the query will be: https://en.wikipedia.org/w/api.php?action=query&prop=coordinates&titles=Kinkaku-ji For more than one title use pipe to separate them: titles=Kinkaku-ji|Paris|... 来源：