How do I avoid data from different tabs to be concatenated in one cell when I scrape a table?

问题

I scraped this page https://www.capfriendly.com/teams/bruins, specifically looking for the tables under the tab Cap Hit (Fowards, Defense, GoalTenders).

I used Python and BeautifulSoup4 and CSV as the output format.

import requests, bs4

r = requests.get('https://www.capfriendly.com/teams/bruins')
soup = bs4.BeautifulSoup(r.text, 'lxml')
table = soup.find(id="team")

with open("csvfile.csv", "w", newline='') as team_data: 
    for tr in table('tr', class_=['odd', 'even']): # get all tr whose class is odd or even 
        row = [td.text for td in tr('td')] # extract td's text 
        writer = csv.writer(team_data) 
        writer.writerow(row)

This is the output that I get:

['Krejci, David "A"', 'NMC', 'C', 'NHL', '30', '$7,250,000$7,250,000NMC', '$7,250,000$7,500,000NMC', '$7,250,000$7,500,000NMC', '$7,250,000$7,000,000Modified NTC', '$7,250,000$7,000,000Modified NTC', 'UFA', '']
['Bergeron, Patrice "A"', 'NMC', 'C', 'NHL', '31', '$6,875,000$8,750,000NMC', '$6,875,000$8,750,000NMC', '$6,875,000$6,875,000$6,000,000NMC', '$6,875,000$4,375,000$3,500,000NMC', '$6,875,000$4,375,000$1,000,000Modified NTC, NMC', '$6,875,000$4,375,000$1,000,000Modified NTC, NMC', 'UFA']
['Backes, David', 'NMC', 'C, RW', 'NHL', '32', '$6,000,000$8,000,000$3,000,000NMC', '$6,000,000$8,000,000$3,000,000NMC', '$6,000,000$6,000,000$3,000,000NMC', '$6,000,000$4,000,000$3,000,000Modified NTC', '$6,000,000$4,000,000$1,000,000Modified NTC', 'UFA', '']
['Marchand, Brad', 'M-NTC', 'LW', 'NHL', '28', '$4,500,000$5,000,000Modified NTC', '$6,125,000$8,000,000$4,000,000NMC', '$6,125,000$8,000,000$3,000,000NMC', '$6,125,000$7,500,000$4,000,000NMC', '$6,125,000$5,000,000$1,000,000NMC', '$6,125,000$6,500,000$4,000,000NMC', '$6,125,000$5,000,000$3,000,000Modified NTC']

As you can see data from different tabs is concatenated together:

'$7,250,000$7,000,000Modified NTC'

Somebody advised me to use javascript to scrape the table and that it should solve my problem?

回答1:

Based on the source code, this is some text in specific rows that is conditionally visible depending on what tab you're on (as your title states). The class .hide is added to the child element in the td when it is intended to be hidden on that specific tab.

When you're parsing the td elements to retreive the text, you could filter out those elements which are suppose to be hidden. In doing so, you can retrieve the text that would be visible as if you were viewing the page in a web browser.

In the snippet below, I added a parse_td function which filters out the children span elements with a class of hide. From there, the corresponding text is returned.

import requests, bs4, csv

r = requests.get('https://www.capfriendly.com/teams/bruins')
soup = bs4.BeautifulSoup(r.text, 'lxml')
table = soup.find(id="team")

with open("csvfile.csv", "w", newline='') as team_data: 
    def parse_td(td):
        filtered_data = [tag.text for tag in td.find_all('span', recursive=False)
                         if 'hide' not in tag.attrs['class']]
        return filtered_data[0] if filtered_data else td.text;

    for tr in table('tr', class_=['odd', 'even']):
        row = [parse_td(td) for td in tr('td')]
        writer = csv.writer(team_data)
        writer.writerow(row)

来源：https://stackoverflow.com/questions/42288444/how-do-i-avoid-data-from-different-tabs-to-be-concatenated-in-one-cell-when-i-sc

标签

javascript

python

web-scraping

bs4