Extract text from Wikipedia html using Python

亡梦爱人 提交于 2019-12-11 11:34:47

问题


I am trying to look for a way to extract the main text of a Wikipedia article using python. I am aware of the "wikipedia" library, but in my case I already have downloaded the html page, and I just need to extract the text. I can't use that library because I need to use wikipedia page html that was downloaded some years ago so I can't download it from scratch.

Is there an "off the shelf" solution that I can use for this purpose?


回答1:


try BeautifulSoup:

from bs4 import BeautifulSoup
import requests

respond = requests.get("http://pl.wikipedia.org/wiki/StackOverflow")
soup = BeautifulSoup(respond.text)
l = soup.find_all('p')
print l[0].text



回答2:


You can use this python module:

pip install wikipedia


来源:https://stackoverflow.com/questions/26284526/extract-text-from-wikipedia-html-using-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!