Scraping: add data stored as a picture to CSV file in python 3.5

让人想犯罪 __ 提交于 2019-12-22 17:46:52

问题


For this project, I am scraping data from a database and attempting to export this data to a spreadsheet for further analysis. (Previously posted here--thanks for the help over there reworking my code!)

I previously thought that finding the winning candidate in the table could be simplified by just always selecting the first name that appears in the table, as I thought the "winners" always appeared first. However, this is not the case.

Whether or not a candidate was elected is stored in the form of a picture in the first column. How would I scrape this and store it in a spreadsheet?

It's located under < td headers > as:

<img src="/WPAPPS/WPR/Content/Images/selected_box.gif" alt="contestant won this nomination contest">

My question is: how would I use BeautifulSoup to parse the HTML table and extract a value from the first column, which is stored in the table as an image rather than text.

I had an idea for attempting some sort of Boolean sorting measure, but I am unsure of how to implement.

My code is below:

from bs4 import BeautifulSoup
import requests
import re
import csv


url = "http://www.elections.ca/WPAPPS/WPR/EN/NC?province=-1&distyear=2013&district=-1&party=-1&pageno={}&totalpages=55&totalcount=1368&secondaryaction=prev25"
rows = []

for i in range(1, 56):
    print(i)
    r  = requests.get(url.format(i))
    data = r.text
    cat = BeautifulSoup(data, "html.parser")
    links = []

    for link in cat.find_all('a', href=re.compile('selectedid=')):
        links.append("http://www.elections.ca" + link.get('href'))  

    for link in links:
        r  = requests.get(link)
        data = r.text
        cat = BeautifulSoup(data, "html.parser")
        lspans = cat.find_all('span')
        cs = cat.find_all("table")[0].find_all("td", headers="name/1")        
        elected = []

        for c in cs:
            elected.append(c.contents[0].strip())

        rows.append([
            lspans[2].contents[0], 
            lspans[3].contents[0], 
            lspans[5].contents[0],
            re.sub("[\n\r/]", "", cat.find("legend").contents[2]).strip(),
            re.sub("[\n\r/]", "",  cat.find_all('div', class_="group")[2].contents[2]).strip().encode('latin-1'),
            len(elected),
            cs[0].contents[0].strip().encode('latin-1')
            ])

with open('filename.csv', 'w', newline='') as f_output:
   csv_output = csv.writer(f_output)
   csv_output.writerows(rows)

Really--any tips would be GREATLY appreciated. Thanks a lot.


回答1:


This snippet will print the name of the elected person:

from bs4 import BeautifulSoup
import requests
req  = requests.get("http://www.elections.ca/WPAPPS/WPR/EN/NC/Details?province=-1&distyear=2013&district=-1&party=-1&selectedid=8548")
page_source = BeautifulSoup(req.text, "html.parser")
table = page_source.find("table",{"id":"gvContestants/1"})
for row in table.find_all("tr"):
    if not row.find("img"):
        continue
    if "selected_box.gif" in row.find("img").get("src"):
        print(''.join(row.find("td",{"headers":"name/1"}).text.split()))

As a side note please refrain yourself from declaring variables with meaningless names. It hurts the eyes of anyone trying to help you and it will hurt you in the future when looking at the code again



来源:https://stackoverflow.com/questions/39771874/scraping-add-data-stored-as-a-picture-to-csv-file-in-python-3-5

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!