问题
I am trying to scrape this website into a .CSV and I am getting an error that says: AssertionError: 9 columns passed, passed data had 30 columns. My code is below, it is a little messy because I exported from Jupyter Notebook.
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
import pandas as pd
url = 'https://apps.azsos.gov/apps/election/cfs/search/CandidateSearch.aspx'
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
html = urlopen(req).read()
soup = BeautifulSoup(html)
type(soup) # we see that soup is a BeautifulSoup object
column_headers = [th.getText() for th in
soup.findAll('tr', limit=2)[1].findAll('th')]
column_headers # our column headers
data_rows = soup.findAll('th')[2:] # skip the first 2 header rows
type(data_rows) # now we have a list of table rows
candidate_data = [[td.getText() for td in data_rows[i].findAll('td')]
for i in range(len(data_rows))]
df = pd.DataFrame(candidate_data, columns=column_headers)
df.head() # head() lets us see the 1st 5 rows of our DataFrame by default
df.to_csv(r'C:/Dev/Sheets/Candiate_Search.csv', encoding='utf-8', index=False)
回答1:
The data on the page [
definitely has a table, and you parse out the column headers and pass them to your CSV. Visually that table has 8 columns, but you parse 9 headers. At this point you should probably go check your data to see what you've found - it might not be what you expect. But okay, you go and check and you see that one of them is a spacer column in the table that will be empty or garbage, and you proceed.
These lines:
data_rows = soup.findAll('th')[2:] # skip the first 2 header rows
type(data_rows) # now we have a list of table rows
candidate_data = [[td.getText() for td in data_rows[i].findAll('td')]
for i in range(len(data_rows))]
find every <th> instance in the page and then every <td> inside each <th>, and that's where it really goes off the rails. I am guessing you are not a web developer, but tables and their sub-elements (rows aka <tr>, headers aka <th>, and cells aka <td>) are used all over most pages for organizing tons of visual elements and also sometimes for organizing tabular data.
Guess what? You found a lot of tables that are not this visual table because you were searching the whole page for <th> elements.
I'd suggest you pre-filter down from using the entire soup by first finding a <table> or <div> that only contains the tabular data you're interested in, and then search within that scope.
来源:https://stackoverflow.com/questions/59634423/beautiful-soup-assertionerror