wikipedia

Wikipedia Data Scraping with Python

爷,独闯天下 提交于 2019-12-05 20:52:41
I am trying to retrieve 3 columns (NFL Team, Player Name, College Team) from the following wikipedia page . I am new to python and have been trying to use beautifulsoup to get this done. I only need the columns that belong to QB's but I haven't even been able to get all the columns despite position. This is what I have so far and it outputs nothing and I'm not entirely sure why. I believe it is due to the a tags but I do not know what to change. Any help would be greatly appreciated.' wiki = "http://en.wikipedia.org/wiki/2008_NFL_draft" header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent

How to get coordinates from a Wikipedia page through API?

北战南征 提交于 2019-12-05 17:51:45
I want to get the coordinates of a Wikipedia page through their API. I want to put the page title as 'titles' parameter. I have searched SO for a solution but seems they are scrapping the page then extracting. Is it possible through their API? You need to use Wikipedia API . For your example with Kinkaku-ji the query will be: https://en.wikipedia.org/w/api.php?action=query&prop=coordinates&titles=Kinkaku-ji For more than one title use pipe to separate them: titles=Kinkaku-ji|Paris|... 来源: https://stackoverflow.com/questions/40098656/how-to-get-coordinates-from-a-wikipedia-page-through-api

Wikipedia API: search for famous people

可紊 提交于 2019-12-05 16:03:59
问题 I have the following Wikipedia API search query: http://en.wikipedia.org/w/api.php?&action=query&generator=search&gsrnamespace=0&gsrlimit=20&prop=pageimages|extracts&pilimit=max&exintro&exsentences=1&exlimit=max&continue&pithumbsize=100&gsrsearch=Albert%20Einstein I just want to list famous people - is there a way to do that? 回答1: There isn't an exact way to limit your search results to only famous people. However, you can use a few different filters in with Wikipedia's CirrusSearch to

MYSQL Huge SQL Files Insertion | MyISAM speed suddenly slow down for Insertions (strange issue)

孤者浪人 提交于 2019-12-05 07:35:23
问题 I'm facing very strange problem, I've asked the question here about speed up the insertion in MYSql, especially about the insertion of Huge SQL files multiple GB in size. They suggested me to use MyISAM engine. I did the following: ALTER TABLE revision ENGINE=MyISAM; Use ALTER TABLE .. DISABLE KEYS . (MyISAM only) Set bulk_insert_buffer_size to 500M. (MyISAM only) Set unique_checks = 0 . not checked. SET autocommit=0; ... SQL import statements ... COMMIT; SET foreign_key_checks=0; It Speed up

Retrieve a list of all Wikipedia languages programmatically

徘徊边缘 提交于 2019-12-05 06:59:30
I need to retrieve a list of all existing languages for a certain wiki project. For example, all Wikivoyage or all Wikipedia languages, just like on their landing pages. I prefer to do this via MediaWiki API , if it's possible. Thanks for your time. Approach 3: Using an API in the Wikimedia wiki farm and Extension:Sitematrix https://commons.wikimedia.org/w/api.php?action=sitematrix&smtype=language While this will return all wikis, the matrix knows about, it is easily filtered client side by code [as of now, one of: wiki (Wikipedia), wiktionary , wikibooks , wikinews , wikiquote , wikisource ,

Generating plain text from a Wikipedia database dump

泪湿孤枕 提交于 2019-12-05 06:16:38
I found a Python script ( here: Wikipedia Extractor ) that can generate plain text from (English) Wikipedia database dump . When I use this command (as it's stated on the script's page): $ python enwiki-latest-pages-articles.xml WikiExtractor.py -b 500K -o extracted I get this error: File "enwiki-latest-pages-articles.xml", line 1 < mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.8/http://www.mediawiki.org/xml/export-0.8.xsd" version="0.8" xml:lang="en"> ^ SyntaxError:

How to build wikipedia category hierarchy?

孤者浪人 提交于 2019-12-05 06:09:47
I'm trying to build the treegraph of wikipedia articles and its categories. What do I need to do that? From this site ( http://dumps.wikimedia.org/enwiki/latest/ ), I've downloaded: enwiki-latest-page.sql.gz enwiki-latest-categorylinks.sql.gz enwiki-20141106-category.sql.gz I tried followed the answer here ( Wikipedia Category Hierarchy from dumps ), but it doesn't seem that the categorylinks has the same schema (no pageId column). What's the right way to build the hierarchy? Bonus question: How can I tell which of the 35M pages in enwiki-latest-page.sql.gz are articles (supposedly about 5M

Auto-hyperlink every occurence of a certain word (or word sequence) to predefined URL (non-ambiguous); but not show full URL

自闭症网瘾萝莉.ら 提交于 2019-12-05 05:58:04
问题 Similar to: Search For Words, Replace With Links. However, I would rather not have the full hyperlink's URL visible, but instead have only the appearance of a hyperlink visible to the end-user. --- I also can't figure out how to use the replace() -JS-function, like used in the following post: How to replace all occurrences of a string in JavaScript?, for this same specific issue. --- Also similar to a JS-question: Link terms on page to Wikipedia articles in pure JavaScript, but I guess the

Why Wikipedia returns 301 response code with certain URL?

安稳与你 提交于 2019-12-05 04:27:21
问题 Some requests with special character in this with the french accents. var client = new HttpClient(); var data0 = await client.GetAsync("http://fr.wikipedia.org/wiki/Monastère_d'Arkadi"); This simple code yields: StatusCode: 301, ReasonPhrase: 'Moved Permanently', Version: 1.1 Any ideas what is happening? Downloading the French article of New York works in fact. I´ve even try to encode the the name but nothing works. 回答1: Wikipedia sends an HTTP 301 indicating that the permanent home of http:/

How to get the Image from first page when search in Google?

若如初见. 提交于 2019-12-05 02:36:51
问题 Usually after using Google to search for a city, there is a part of Wikipedia page on the right with an image and a map. Can anyone tell me how I could access this image? I should know how to download it. 回答1: Actually the main image (that goes with the map image on the right) is very rarely from Wikipedia, so you can't use Wikipedia API to get it. If you want to access the actual main image you can use this: private static void GetGoogleImage(string word) { // make an HTTP Get request var