Beautiful Soup AssertionError

|▌冷眼眸甩不掉的悲伤 提交于 2020-02-25 13:18:25

问题


I am trying to scrape this website into a .CSV and I am getting an error that says: AssertionError: 9 columns passed, passed data had 30 columns. My code is below, it is a little messy because I exported from Jupyter Notebook.

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
import pandas as pd

url = 'https://apps.azsos.gov/apps/election/cfs/search/CandidateSearch.aspx'

req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
html = urlopen(req).read()
soup = BeautifulSoup(html)

type(soup)  # we see that soup is a BeautifulSoup object

column_headers = [th.getText() for th in 
                  soup.findAll('tr', limit=2)[1].findAll('th')]
column_headers # our column headers

data_rows = soup.findAll('th')[2:]  # skip the first 2 header rows

type(data_rows)  # now we have a list of table rows

candidate_data = [[td.getText() for td in data_rows[i].findAll('td')]
            for i in range(len(data_rows))]

df = pd.DataFrame(candidate_data, columns=column_headers)
df.head()  # head() lets us see the 1st 5 rows of our DataFrame by default

df.to_csv(r'C:/Dev/Sheets/Candiate_Search.csv', encoding='utf-8', index=False)

回答1:


The data on the page [

definitely has a table, and you parse out the column headers and pass them to your CSV. Visually that table has 8 columns, but you parse 9 headers. At this point you should probably go check your data to see what you've found - it might not be what you expect. But okay, you go and check and you see that one of them is a spacer column in the table that will be empty or garbage, and you proceed.

These lines:

data_rows = soup.findAll('th')[2:]  # skip the first 2 header rows

type(data_rows)  # now we have a list of table rows

candidate_data = [[td.getText() for td in data_rows[i].findAll('td')]
        for i in range(len(data_rows))]

find every <th> instance in the page and then every <td> inside each <th>, and that's where it really goes off the rails. I am guessing you are not a web developer, but tables and their sub-elements (rows aka <tr>, headers aka <th>, and cells aka <td>) are used all over most pages for organizing tons of visual elements and also sometimes for organizing tabular data.

Guess what? You found a lot of tables that are not this visual table because you were searching the whole page for <th> elements.

I'd suggest you pre-filter down from using the entire soup by first finding a <table> or <div> that only contains the tabular data you're interested in, and then search within that scope.



来源:https://stackoverflow.com/questions/59634423/beautiful-soup-assertionerror

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!