How to pull links from within an 'a' tag

馋奶兔 提交于 2019-12-25 01:45:59

问题


I have attempted several methods to pull links from the following webpage, but can't seem to find the desired links. From this webpage (https://www.espn.com/collegefootball/scoreboard//year/2019/seasontype/2/week/1) I am attempting to extract all of the links for the "gamecast" button. The example of the first one I would be attempting to get is this: https://www.espn.com/college-football/game//gameId/401110723

When I try to just pull all links on the page I do not even seem to get the desired ones at all, so I'm confused where I'm going wrong here. A few attempts I have made below that don't seem to be pulling in what I want. First method I tried below.

import requests
import csv
from bs4 import BeautifulSoup
import pandas as pd

page = requests.get('https://www.espn.com/college-football/scoreboard/_/year/2019/seasontype/2/week/1')
soup = BeautifulSoup(page.text, 'html.parser')
# game_id = soup.find(name_='&lpos=college-football:scoreboard:gamecast')
game_id = soup.find('a',class_='button-alt sm')

Here is a second method I tried. Any help is greatly appreciated.

for a in soup.find_all('a'):
if 'college-football' in a['href']:
print(link['href'])

Edit: as a clarification I am attempting to pull all links that contain a gameID as in the example link.


回答1:


The button with the link you are trying to have is loaded with javascript. The requests module does not load the javascript in the html it is searching through. Therefore, you cannot scrape the button directly to find the links you desire (without a web page simulator like Selenium). However, I found json data in the html that contains the scoreboard data in which the link is located in. If you are also looking to scrape more information (times, etc.) from this page, I highly recommend looking through the json data in the variable json_scoreboard in the code.

Code

import requests, re, json
from bs4 import BeautifulSoup

r = requests.get(r'https://www.espn.com/college-football/scoreboard/_/year/2019/seasontype/2/week/1')
soup = BeautifulSoup(r.text, 'html.parser')

scripts_head = soup.find('head').find_all('script')
all_links = {}
for script in scripts_head:
    if 'window.espn.scoreboardData' in script.text:
        json_scoreboard = json.loads(re.search(r'({.*?});', script.text).group(1))
        for event in json_scoreboard['events']:
            name = event['name']
            for link in event['links']:
                if link['text'] == 'Gamecast':
                    gamecast = link['href']
            all_links[name] = gamecast

print(all_links)

Output

{'Miami Hurricanes at Florida Gators': 'http://www.espn.com/college-football/game/_/gameId/401110723', 'Georgia Tech Yellow Jackets at Clemson Tigers': 'http://www.espn.com/college-football/game/_/gameId/401111653', 'Texas State Bobcats at Texas A&M Aggies': 'http://www.espn.com/college-football/game/_/gameId/401110731', 'Utah Utes at BYU Cougars': 'http://www.espn.com/college-football/game/_/gameId/401114223', 'Florida A&M Rattlers at UCF Knights': 'http://www.espn.com/college-football/game/_/gameId/401117853', 'Tulsa Golden Hurricane at Michigan State Spartans': 'http://www.espn.com/college-football/game/_/gameId/401112212', 'Wisconsin Badgers at South Florida Bulls': 'http://www.espn.com/college-football/game/_/gameId/401117856', 'Duke Blue Devils at Alabama Crimson Tide': 'http://www.espn.com/college-football/game/_/gameId/401110720', 'Georgia Bulldogs at Vanderbilt Commodores': 'http://www.espn.com/college-football/game/_/gameId/401110732', 'Florida Atlantic Owls at Ohio State Buckeyes': 'http://www.espn.com/college-football/game/_/gameId/401112251', 'Georgia Southern Eagles at LSU Tigers': 'http://www.espn.com/college-football/game/_/gameId/401110725', 'Middle Tennessee Blue Raiders at Michigan Wolverines': 'http://www.espn.com/college-football/game/_/gameId/401112222', 'Louisiana Tech Bulldogs at Texas Longhorns': 'http://www.espn.com/college-football/game/_/gameId/401112135', 'Oregon Ducks at Auburn Tigers': 'http://www.espn.com/college-football/game/_/gameId/401110722', 'Eastern Washington Eagles at Washington Huskies': 'http://www.espn.com/college-football/game/_/gameId/401114233', 'Idaho Vandals at Penn State Nittany Lions': 'http://www.espn.com/college-football/game/_/gameId/401112257', 'Miami (OH) RedHawks at Iowa Hawkeyes': 'http://www.espn.com/college-football/game/_/gameId/401112191', 'Northern Iowa Panthers at Iowa State Cyclones': 'http://www.espn.com/college-football/game/_/gameId/401112085', 'Syracuse Orange at Liberty Flames': 'http://www.espn.com/college-football/game/_/gameId/401112434', 'New Mexico State Aggies at Washington State Cougars': 'http://www.espn.com/college-football/game/_/gameId/401114228', 'South Alabama Jaguars at Nebraska Cornhuskers': 'http://www.espn.com/college-football/game/_/gameId/401112238', 'Northwestern Wildcats at Stanford Cardinal': 'http://www.espn.com/college-football/game/_/gameId/401112245', 'Houston Cougars at Oklahoma Sooners': 'http://www.espn.com/college-football/game/_/gameId/401112114', 'Notre Dame Fighting Irish at Louisville Cardinals': 'http://www.espn.com/college-football/game/_/gameId/401112436'}


来源:https://stackoverflow.com/questions/58106281/how-to-pull-links-from-within-an-a-tag

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!