Extract the first paragraph from a Wikipedia article (Python)

前端 未结 10 1555
闹比i
闹比i 2020-11-28 01:36

How can I extract the first paragraph from a Wikipedia article, using Python?

For example, for Albert Einstein, that would be:

<
10条回答
  •  小蘑菇
    小蘑菇 (楼主)
    2020-11-28 02:01

    Wikipedia runs a MediaWiki extension that provides exactly this functionality as an API module. TextExtracts implements action=query&prop=extracts with options to return the first N sentences and/or just the introduction, as HTML or plain text.

    Here's the API call you want to make, try it: https://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Albert%20Einstein&exintro=&exsentences=2&explaintext=&redirects=&formatversion=2

    • action=query&prop=extracts to request this info
    • (ex)sentences=2, (ex)intro=, (ex)plaintext, are parameters to the module (see the first link for its API doc) asking for two sentences from the intro as plain text; leave off the latter for HTML.
    • redirects=(true) so if you ask for "titles=Einstein" you'll get the Albert Einstein page info
    • formatversion=2 for a cleaner format in UTF-8.

    There are various libraries that wrap invoking the MediaWiki action API, such as the one in DGund's answer, but it's not too hard to make the API calls yourself.

    Page info in search results discusses getting this text extract, along with getting a description and lead image for articles.

提交回复
热议问题