Scraping Wikipedia information (table)

五迷三道 提交于 2020-12-16 03:58:27

问题


I would need to scrape information regarding Elenco dei comuni per regione on Wikipedia. I would like to create an array that can allow me to associate each comune to the corresponding region, i.e. something like this:

'Abbateggio': 'Pescara' -> Abruzzo

I tried to get information using BeautifulSoup and requests as follows:

from bs4 import BeautifulSoup as bs
import requests

     with requests.Session() as s: # use session object for efficiency of tcp re-use
        s.headers = {'User-Agent': 'Mozilla/5.0'}
        r = s.get('https://it.wikipedia.org/wiki/Comuni_d%27Italia')
        soup = bs(r.text, 'html.parser')
        for ele in soup.find_all('h3')[:6]:
            tx = bs(str(ele),'html.parser').find('span', attrs={'class': "mw-headline"})
            if tx is not None:
                print(tx['id'])

however it does not work (it returns me an empty list). The information that I have looked at using Inspect of Google Chrome are the following:

<span class="mw-headline" id="Elenco_dei_comuni_per_regione">Elenco dei comuni per regione</span> (table)

<a href="/wiki/Comuni_dell%27Abruzzo" title="Comuni dell'Abruzzo">Comuni dell'Abruzzo</a> 

(this field should change for each region)

then <table class="wikitable sortable query-tablesortes">

Could you please give me advice on how to get such results? Any help and suggestion will be appreciated.

EDIT:

Example:

I have a word: comunediabbateggio. This word includes Abbateggio. I would like to know which region can be associated with that city, if it exists. Information from Wikipedia needs to create a dataset that can allow me to check the field and associate to comuni/cities a region. What I should expect is:

WORD                         REGION/STATE
comunediabbateggio           Pescara

I hope this can help you. Sorry if it was not clear. Another example for English speaker that might be slightly better for understanding is the following:

Instead of the Italian link above, you can also consider the following: https://en.wikipedia.org/wiki/List_of_comuni_of_Italy . For each region (Lombardia, Veneto, Sicily, ... ) I would need to collect information about the list of communes of the Provinces. if you click in a link of List of Communes of ... , there is a table that list the comune, e.g. https://en.wikipedia.org/wiki/List_of_communes_of_the_Province_of_Agrigento.


回答1:


import re
import requests
from bs4 import BeautifulSoup
import pandas as pd
from tqdm import tqdm



target = "https://en.wikipedia.org/wiki/List_of_comuni_of_Italy"


def main(url):
    with requests.Session() as req:
        r = req.get(url)
        soup = BeautifulSoup(r.content, 'html.parser')

        provinces = [item.find_next("span").text for item in soup.findAll(
            "span", class_="tocnumber", text=re.compile(r"\d[.]\d"))]

        search = [item.replace(
            " ", "_") if " " in item else item for item in provinces]

        nested = []
        for item in search:
            for a in soup.findAll("span", id=item):
                goes = [b.text.split("of ")[-1]
                        for b in a.find_next("ul").findAll("a")]
                nested.append(goes)

        dictionary = dict(zip(provinces, nested))

        urls = [f'{url[:24]}{b.get("href")}' for item in search for a in soup.findAll(
            "span", id=item) for b in a.find_next("ul").findAll("a")]
    return urls, dictionary


def parser():
    links, dics = main(target)
    com = []
    for link in tqdm(links):
        try:
            df = pd.read_html(link)[0]
            com.append(df[df.columns[1]].to_list()[:-1])
        except ValueError:
            com.append(["N/A"])
    com = iter(com)
    for x in dics:
        b = dics[x]
        dics[x] = dict(zip(b, com))
    print(dics)


parser()


来源:https://stackoverflow.com/questions/61021901/scraping-wikipedia-information-table

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!