scraping data from wikipedia table

前端 未结 3 1402
庸人自扰
庸人自扰 2020-12-18 17:24

I\'m just trying to scrape data from a wikipedia table into a panda dataframe.

I need to reproduce the three columns: \"Postcode, Borough, Neighbourhood\".



        
3条回答
  •  孤城傲影
    2020-12-18 18:07

    You need to iterate over each row in the table and store the data row by row, not just in one giant list. Try something like this:

    import pandas
    import requests
    from bs4 import BeautifulSoup
    website_text = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
    soup = BeautifulSoup(website_text,'xml')
    
    table = soup.find('table',{'class':'wikitable sortable'})
    table_rows = table.find_all('tr')
    
    data = []
    for row in table_rows:
        data.append([t.text.strip() for t in row.find_all('td')])
    
    df = pandas.DataFrame(data, columns=['PostalCode', 'Borough', 'Neighbourhood'])
    df = df[~df['PostalCode'].isnull()]  # to filter out bad rows
    

    then

    >>> df.head()
    
      PostalCode           Borough     Neighbourhood
    1        M1A      Not assigned      Not assigned
    2        M2A      Not assigned      Not assigned
    3        M3A        North York         Parkwoods
    4        M4A        North York  Victoria Village
    5        M5A  Downtown Toronto      Harbourfront
    

提交回复
热议问题