Beautiful Soup AssertionError

问题

I am trying to scrape this website into a .CSV and I am getting an error that says: AssertionError: 9 columns passed, passed data had 30 columns. My code is below, it is a little messy because I exported from Jupyter Notebook.

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
import pandas as pd

url = 'https://apps.azsos.gov/apps/election/cfs/search/CandidateSearch.aspx'

req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
html = urlopen(req).read()
soup = BeautifulSoup(html)

type(soup)  # we see that soup is a BeautifulSoup object

column_headers = [th.getText() for th in 
                  soup.findAll('tr', limit=2)[1].findAll('th')]
column_headers # our column headers

data_rows = soup.findAll('th')[2:]  # skip the first 2 header rows

type(data_rows)  # now we have a list of table rows

candidate_data = [[td.getText() for td in data_rows[i].findAll('td')]
            for i in range(len(data_rows))]

df = pd.DataFrame(candidate_data, columns=column_headers)
df.head()  # head() lets us see the 1st 5 rows of our DataFrame by default

df.to_csv(r'C:/Dev/Sheets/Candiate_Search.csv', encoding='utf-8', index=False)

回答1:

The data on the page [

definitely has a table, and you parse out the column headers and pass them to your CSV. Visually that table has 8 columns, but you parse 9 headers. At this point you should probably go check your data to see what you've found - it might not be what you expect. But okay, you go and check and you see that one of them is a spacer column in the table that will be empty or garbage, and you proceed.

These lines:

data_rows = soup.findAll('th')[2:]  # skip the first 2 header rows

type(data_rows)  # now we have a list of table rows

candidate_data = [[td.getText() for td in data_rows[i].findAll('td')]
        for i in range(len(data_rows))]

find every <th> instance in the page and then every <td> inside each <th>, and that's where it really goes off the rails. I am guessing you are not a web developer, but tables and their sub-elements (rows aka <tr>, headers aka <th>, and cells aka <td>) are used all over most pages for organizing tons of visual elements and also sometimes for organizing tabular data.

Guess what? You found a lot of tables that are not this visual table because you were searching the whole page for <th> elements.

I'd suggest you pre-filter down from using the entire soup by first finding a <table> or <div> that only contains the tabular data you're interested in, and then search within that scope.

来源：https://stackoverflow.com/questions/59634423/beautiful-soup-assertionerror

标签

python

pandas

beautifulsoup

jupyter-notebook