wikipedia

Finding and downloading images within the Wikipedia Dump

主宰稳场 提交于 2019-12-13 11:36:56
问题 I'm trying to find a comprehensive list of all images on wikipedia, which I can then filter down to the public domain ones. I've downloaded the SQL dumps from here: http://dumps.wikimedia.org/enwiki/latest/ And studied the DB schema: http://upload.wikimedia.org/wikipedia/commons/thumb/4/42/MediaWiki_1.20_%2844edaa2%29_database_schema.svg/2193px-MediaWiki_1.20_%2844edaa2%29_database_schema.svg.png I think I understand it but when I pick a sample image from a wikipedia page I can't find it

How can I extract specific links in Wikipedia articles using jsoup?

自古美人都是妖i 提交于 2019-12-13 09:38:40
问题 I am doing an NLP project and I need to know how to extract links that only are in the "introduction" section and in the "geography" section of this wikipidia page: http://en.wikipedia.org/wiki/Boston Could you please help me? 回答1: Wikipedia does not make this easy. I don't claim this to be elegant or even very reuseable. Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Boston").timeout(5000).get(); Element intro = doc.body().select("p").first(); while (intro.tagName().equals("p"))

How to get associated (English) Wikipedia page from Wikidata page / Q number using Wikidata dump?

我的未来我决定 提交于 2019-12-13 04:03:42
问题 For @en text alone, a single item from the Wikidata dump contains multiple names: <http://www.wikidata.org/entity/Q26> <http://www.w3.org/2000/01/rdf-schema#label> "Northern Ireland"@en . <http://www.wikidata.org/entity/Q26> <http://www.w3.org/2004/02/skos/core#prefLabel> "Northern Ireland"@en . <http://www.wikidata.org/entity/Q26> <http://schema.org/name> "Northern Ireland"@en . On the Wikidata page for this article (http://www.wikidata.org/entity/Q26), which of these (if any) corresponds to

How to get the result of “all pages with prefix” using Wikipedia api?

我们两清 提交于 2019-12-12 18:03:25
问题 I wish to use Wikipedia api to extract the result of this page: http://en.wikipedia.org/wiki/Special:PrefixIndex When searching "something" on it, for example this: http://en.wikipedia.org/w/index.php?title=Special%3APrefixIndex&prefix=tal&namespace=4 Then, I would like to access each of the resulting pages and extract their information. What api call might I use? 回答1: You can use list=allpages and specify apprefix . For example: http://en.wikipedia.org/w/api.php?format=xml&action=query&list

How to scrape Subcategories and pages in categories of a Category wikipedia page using Python

血红的双手。 提交于 2019-12-12 15:38:18
问题 So I'm trying to scrape all the subcategories and pages under the category header of the Category page: "Category: Class-based programming languages" found at: https://en.wikipedia.org/wiki/Category:Class-based_programming_languages I've figured out a way to do this using urls and the mediawiki API: Categorymembers. The way to do that would be: base: en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle=Category:Class-based%20programming%20languages&format=json&cmlimit=500 base

Issues with wikipedia dump table pagelinks

烂漫一生 提交于 2019-12-12 13:32:04
问题 I downloaded the enwiki-latest-pagelinks.sql.gz dump from dumps.wikimedia.org/enwiki/latest/ . I upacked the file, its uncompressed size is 37G. The table structure is this: SHOW CREATE TABLE wp_dump.pagelinks; CREATE TABLE `pagelinks` ( `pl_from` int(8) unsigned NOT NULL DEFAULT '0', `pl_namespace` int(11) NOT NULL DEFAULT '0', `pl_title` varbinary(255) NOT NULL DEFAULT '', `pl_from_namespace` int(11) NOT NULL DEFAULT '0', UNIQUE KEY `pl_from` (`pl_from`,`pl_namespace`,`pl_title`), KEY `pl

extract unidentified html content from between two tags, using jsoup? regex?

不想你离开。 提交于 2019-12-12 12:02:18
问题 I want to get the names of all those links from between the two h2 tags there <h2><span class="mw-headline" id="People">People</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Bush&action=edit&section=1" title="Edit section: People">edit</a><span class="mw-editsection-bracket">]</span></span></h2> <ul> <li><a href="/wiki/George_H._W._Bush" title="George H. W. Bush">George H. W. Bush</a> (born 1924), the 41st president of the United

Using SPARQL to query DBPedia Company Information

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-12 10:06:25
问题 I'm trying to query DBPedia using SPARQL only to find company information such as a description, and a logo. I'm rather lost with devising the SPARQL Query to do this. SELECT DISTINCT ?subject ?employees ?homepage WHERE { ?subject rdf:type <http://dbpedia.org/class/yago/Company108058098> . ?subject dbpedia2:numEmployees ?employees FILTER ( xsd:integer(?employees) >= 50000 ) . ?subject foaf:homepage ?homepage . } ORDER BY DESC(xsd:integer(?employees)) LIMIT 20 I have come across the above

Easy way to export Wikipedia's translated titles

泄露秘密 提交于 2019-12-12 09:53:17
问题 Is there an easy way to export Wikipedia's translated titles to get a set like this: russian_title -> english_title ? I tried to get ones from ruwiki-latest-pages-meta-current.xml.bz2 and ruwiki-latest-pages-articles.xml.bz2, however, there are less than 25k translations. I found out some are not present. E.g. one can see a link to English wiki here, but there is no link [[en:Yandex]] in the dump. Maybe I should try to parse English Wikipedia, but I'm sure there is a nicer solution. BTW, I'm

How to get coordinates from a Wikipedia page through API?

馋奶兔 提交于 2019-12-12 09:43:28
问题 I want to get the coordinates of a Wikipedia page through their API. I want to put the page title as 'titles' parameter. I have searched SO for a solution but seems they are scrapping the page then extracting. Is it possible through their API? 回答1: You need to use Wikipedia API. For your example with Kinkaku-ji the query will be: https://en.wikipedia.org/w/api.php?action=query&prop=coordinates&titles=Kinkaku-ji For more than one title use pipe to separate them: titles=Kinkaku-ji|Paris|... 来源: