How to get Wikipedia content as text by API?

徘徊边缘 提交于 2021-02-08 08:52:11

问题


I want to get Wikipedia pages as text.

I looked at the Wikipedia API from here https://en.wikipedia.org/w/api.php which says that in order to get pages as text I need to append this to a page address:

api.php?action=query&meta=siteinfo&siprop=namespaces&format=txt

However, when I try appending this suffix to a normal page's address, the page is not found:

https://en.wikipedia.org/wiki/George_Washington/api.php?action=query&meta=siteinfo&siprop=namespaces&format=txt

Following the instructions from Get Text Content from mediawiki page via API, I tried adding /api.php?action=parse&page=test to the end of the query string. Therefore, I obtained this:

https://en.wikipedia.org/wiki/George_Washington/api.php?action=parse&page=test

However, this doesn't work either.


回答1:


NB: All this examples are CORS enabled.


Get the text in json format, from the precise title (as seen in the wikipedia page url):

https://en.wikipedia.org/w/api.php?action=query&origin=*&prop=extracts&explaintext&titles=Sokolsky_Opening&format=json


Search relevant pages by keywords, get IDs, get precise titles/url, get some quick text extract;

https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exlimit=max&format=json&exsentences=1&origin=*&exintro=&explaintext=&generator=search&gsrlimit=23&gsrsearch=chess


Get the wiki page ID by the precise title:

https://en.wikipedia.org/w/api.php?action=query&origin=*&prop=pageprops&format=json&titles=Sokolsky_Opening


Get the full html by wiki page ID:

https://en.wikipedia.org/w/api.php?action=parse&origin=*&format=json&pageid=100017


Get stripped html, lighter version without the wikidata.

https://en.wikipedia.org/w/api.php?action=query&origin=*&prop=extracts&format=json&titles=Sokolsky_Opening


Cross origin:

Btw, using CORS requests, by knowing or searching the ID and/or the page title, we can use fetch to embed some wiki text anywhere, in a ssl context.

In the event of an unknown ID, we have to loop trough the json.

fetch("https://en.wikipedia.org/w/api.php?action=query&origin=*&prop=extracts&explaintext&format=json&titles=Sokolsky_Opening").then(v => v.json()).then((function(v){
    main.innerHTML = v["query"]["pages"]["100017"]["extract"]
    })
  )
<pre id="main" style="white-space: pre-wrap"></pre>

Good luck.




回答2:


You have to use some of these formats: json, jsonfm, none, php, phpfm, rawfm, xml or xmlfm, so txt is not valid format. Also your API link is wrong, use this:

https://en.wikipedia.org/w/api.php?action=query&titles=George_Washington&prop=revisions&rvprop=content&format=xml


来源:https://stackoverflow.com/questions/33844207/how-to-get-wikipedia-content-as-text-by-api

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!