scraping data from wikipedia table

前端 未结 3 1398
庸人自扰
庸人自扰 2020-12-18 17:24

I\'m just trying to scrape data from a wikipedia table into a panda dataframe.

I need to reproduce the three columns: \"Postcode, Borough, Neighbourhood\".



        
相关标签:
3条回答
  • 2020-12-18 17:44

    Basedig provides a platform to download Wikipedia tables as Excel, CSV or JSON files directly. Here is a link to the Wikipedia source: https://www.basedig.com/wikipedia/

    If you do not find the dataset you are looking for on Basedig, send them the link to your article and they'll parse it for you. Hope this helps

    0 讨论(0)
  • 2020-12-18 18:05

    You may be overthinking the problem, if you only want the script to pull one table from the page. One import, one line, no loops:

    import pandas as pd
    url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
    
    df=pd.read_html(url, header=0)[0]
    
    df.head()
    
        Postcode    Borough         Neighbourhood
    0   M1A         Not assigned    Not assigned
    1   M2A         Not assigned    Not assigned
    2   M3A         North York      Parkwoods
    3   M4A         North York      Victoria Village
    4   M5A         Downtown Toronto    Harbourfront
    
    0 讨论(0)
  • 2020-12-18 18:07

    You need to iterate over each row in the table and store the data row by row, not just in one giant list. Try something like this:

    import pandas
    import requests
    from bs4 import BeautifulSoup
    website_text = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
    soup = BeautifulSoup(website_text,'xml')
    
    table = soup.find('table',{'class':'wikitable sortable'})
    table_rows = table.find_all('tr')
    
    data = []
    for row in table_rows:
        data.append([t.text.strip() for t in row.find_all('td')])
    
    df = pandas.DataFrame(data, columns=['PostalCode', 'Borough', 'Neighbourhood'])
    df = df[~df['PostalCode'].isnull()]  # to filter out bad rows
    

    then

    >>> df.head()
    
      PostalCode           Borough     Neighbourhood
    1        M1A      Not assigned      Not assigned
    2        M2A      Not assigned      Not assigned
    3        M3A        North York         Parkwoods
    4        M4A        North York  Victoria Village
    5        M5A  Downtown Toronto      Harbourfront
    
    0 讨论(0)
提交回复
热议问题