wikipedia

How to obtain a list of titles of all Wikipedia articles

爱⌒轻易说出口 提交于 2020-05-09 19:31:12
问题 I'd like to obtain a list of all the titles of all Wikipedia articles. I know there are two possible ways to get content from a Wikimedia powered wiki. One would be the API and the other one would be a database dump. I'd prefer not to download the wiki dump. First, it's huge, and second, I'm not really experienced with querying databases. The problem with the API on the other hand is that I couldn't figure out a way to only retrieve a list of the article titles and even if it would need > 4

get wikipedia id-page using wikidata id

为君一笑 提交于 2020-04-17 21:13:40
问题 Using the sparql query below, I successfully got some of soccer players informations, then I tried to retrieve wikipedia page-id with wikidata Item-id, but it returns an error (java.util.concurrent.TimeoutException): PREFIX wd: <http://www.wikidata.org/entity/> PREFIX owl: <http://www.w3.org/2002/07/owl#> PREFIX dbo: <http://dbpedia.org/ontology/> SELECT ?Wikidata_id ?SoccerPlayerLabel ?Team ?TeamLabel ?numMatches ?numGoals ?startTime ?article ?wikipedia_id WHERE { ?Wikidata_id wdt:P106 wd

'HTMLParseError' when importing 'wikipedia' module in Python3 [duplicate]

六月ゝ 毕业季﹏ 提交于 2020-03-05 10:47:24
问题 This question already has answers here : Importing bs4 in Python 3.5 (3 answers) Closed 4 years ago . I installed the 'wikipedia' module on my Windows 7 machine with pip install wikipedia , but when I run this simple script: import wikipedia print (wikipedia.summary("Wikipedia")) I get an error that says ImportError: cannot import name 'HTMLParseError' . I'm using Python version 3.5 and the latest version of the wikipedia module. Is there another library that will give me this function? 回答1:

'HTMLParseError' when importing 'wikipedia' module in Python3 [duplicate]

早过忘川 提交于 2020-03-05 10:47:06
问题 This question already has answers here : Importing bs4 in Python 3.5 (3 answers) Closed 4 years ago . I installed the 'wikipedia' module on my Windows 7 machine with pip install wikipedia , but when I run this simple script: import wikipedia print (wikipedia.summary("Wikipedia")) I get an error that says ImportError: cannot import name 'HTMLParseError' . I'm using Python version 3.5 and the latest version of the wikipedia module. Is there another library that will give me this function? 回答1:

enWiki dump python function

最后都变了- 提交于 2020-02-08 02:31:32
问题 I am looking to create a function that goes through the xml file article and then, for each article: if it contains the keywords, moral or ethic (wildcard search): move it to another folder else: ignore I have tried a few things and had a look round but really struggling (not even sure if you can do wildcard searching) as I have only just started using Python, any help would be much appreciated. Here is an example of the xml code below... <page> <title>Anarchism</title> <ns>0</ns> <id>12</id>

Read Wikipedia piped links

我只是一个虾纸丫 提交于 2020-01-17 04:55:06
问题 I'm using java and I want to read piped links from Wikipedia that has a specific surface form. Fir example in this form [America|US] the surface form is "US" and the internal link is "America". The straightforward solution is to read the xml dump of Wikipedia and find the strings that matches the regular expression for a piped link. However I am afraid that I wouldn't cover all the possible regular expressions of a piped link. I searched and I couldn't find any library that specifically give

scraping data from wikipedia table

我怕爱的太早我们不能终老 提交于 2020-01-16 16:32:49
问题 I'm just trying to scrape data from a wikipedia table into a panda dataframe. I need to reproduce the three columns: "Postcode, Borough, Neighbourhood". import requests website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text from bs4 import BeautifulSoup soup = BeautifulSoup(website_url,'xml') print(soup.prettify()) My_table = soup.find('table',{'class':'wikitable sortable'}) My_table links = My_table.findAll('a') links Neighbourhood = [] for link in

scraping data from wikipedia table

為{幸葍}努か 提交于 2020-01-16 16:30:09
问题 I'm just trying to scrape data from a wikipedia table into a panda dataframe. I need to reproduce the three columns: "Postcode, Borough, Neighbourhood". import requests website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text from bs4 import BeautifulSoup soup = BeautifulSoup(website_url,'xml') print(soup.prettify()) My_table = soup.find('table',{'class':'wikitable sortable'}) My_table links = My_table.findAll('a') links Neighbourhood = [] for link in

extract loosly structured wikipedia text. html

亡梦爱人 提交于 2020-01-16 04:26:28
问题 Some of the html on wikipedia disambiguation pages is, shall we say, ambiguous, i.e. the links there that connect to specific persons named Corzine are difficult to capture using jsoup because they're not explicitly structured, nor do they live in a particular section as in this example. See the page Corzine page here. How can I get a hold of them? Is jsoup a suitable tool for this task? Perhaps I should use regex, but I fear doing that because I want it to be generalizable. </b> may refer to

How to get information from movies Wikipedia category by API?

时光怂恿深爱的人放手 提交于 2020-01-11 11:25:13
问题 Is it possible to fetch information from Wikipedia API by movies category? e.g I've a url which search avatar but I don't know how to search avatar movie. https://en.wikipedia.org/w/api.php?&titles=avatar&format=xml&action=query&prop=extracts|categories|categoryinfo|pageterms|pageprops|pageimages&exintro=&explaintext=&cllimit=max&piprop=original 回答1: It will not be easy by "movies category" because there are a lot of nested categories, but you can use something else - all articles about movie