问题
For this project, I am scraping data from a database and attempting to export this data to a spreadsheet for further analysis. (Previously posted here--thanks for the help over there reworking my code!)
I previously thought that finding the winning candidate in the table could be simplified by just always selecting the first name that appears in the table, as I thought the "winners" always appeared first. However, this is not the case.
Whether or not a candidate was elected is stored in the form of a picture in the first column. How would I scrape this and store it in a spreadsheet?
It's located under < td headers > as:
<img src="/WPAPPS/WPR/Content/Images/selected_box.gif" alt="contestant won this nomination contest">
My question is: how would I use BeautifulSoup to parse the HTML table and extract a value from the first column, which is stored in the table as an image rather than text.
I had an idea for attempting some sort of Boolean sorting measure, but I am unsure of how to implement.
My code is below:
from bs4 import BeautifulSoup
import requests
import re
import csv
url = "http://www.elections.ca/WPAPPS/WPR/EN/NC?province=-1&distyear=2013&district=-1&party=-1&pageno={}&totalpages=55&totalcount=1368&secondaryaction=prev25"
rows = []
for i in range(1, 56):
print(i)
r = requests.get(url.format(i))
data = r.text
cat = BeautifulSoup(data, "html.parser")
links = []
for link in cat.find_all('a', href=re.compile('selectedid=')):
links.append("http://www.elections.ca" + link.get('href'))
for link in links:
r = requests.get(link)
data = r.text
cat = BeautifulSoup(data, "html.parser")
lspans = cat.find_all('span')
cs = cat.find_all("table")[0].find_all("td", headers="name/1")
elected = []
for c in cs:
elected.append(c.contents[0].strip())
rows.append([
lspans[2].contents[0],
lspans[3].contents[0],
lspans[5].contents[0],
re.sub("[\n\r/]", "", cat.find("legend").contents[2]).strip(),
re.sub("[\n\r/]", "", cat.find_all('div', class_="group")[2].contents[2]).strip().encode('latin-1'),
len(elected),
cs[0].contents[0].strip().encode('latin-1')
])
with open('filename.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerows(rows)
Really--any tips would be GREATLY appreciated. Thanks a lot.
回答1:
This snippet will print the name of the elected person:
from bs4 import BeautifulSoup
import requests
req = requests.get("http://www.elections.ca/WPAPPS/WPR/EN/NC/Details?province=-1&distyear=2013&district=-1&party=-1&selectedid=8548")
page_source = BeautifulSoup(req.text, "html.parser")
table = page_source.find("table",{"id":"gvContestants/1"})
for row in table.find_all("tr"):
if not row.find("img"):
continue
if "selected_box.gif" in row.find("img").get("src"):
print(''.join(row.find("td",{"headers":"name/1"}).text.split()))
As a side note please refrain yourself from declaring variables with meaningless names. It hurts the eyes of anyone trying to help you and it will hurt you in the future when looking at the code again
来源:https://stackoverflow.com/questions/39771874/scraping-add-data-stored-as-a-picture-to-csv-file-in-python-3-5