{xml_nodeset (0)} issue when webscraping table

问题

I'm trying to scrape the first table from this url:

https://www.whoscored.com/Matches/318578/LiveStatistics/England-Premier-League-2009-2010-Blackburn-Arsenal

using the following code:

url <- "https://www.whoscored.com/Matches/318578/LiveStatistics/England-Premier-League-2009-2010-Blackburn-Arsenal"
data <- url %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="top-player-stats-summary-grid"]')

which gives data a value of {xml_nodeset (0)}

url <- "https://www.whoscored.com/Matches/318578/LiveStatistics/England-Premier-League-2009-2010-Blackburn-Arsenal"
data <- url %>%
  read_html() %>%
  html_nodes(css='.grid')

gives the same problem.

Apparently this might be a javascript issue - is there a fast way to extract the relevant data? Inspecting the table entries seems to show that the data is not imported from elsewhere but is coded into the page, so it seems I should be able to extract it from the source code (sorry, I am completely ignorant of how HTML and JS work so my question might not make sense).

回答1:

The page dynamically updates content via javascript running on page when using browser. This doesn't happen with rvest. You can however observe in dev tools network tab the xhr call which returns this content as json

require(httr)
require(jsonlite)

headers = c('user-agent' = 'Mozilla/5.0',
            'accept' = 'application/json, text/javascript, */*; q=0.01',
           'referer' = 'https://www.whoscored.com/Matches/318578/LiveStatistics/England-Premier-League-2009-2010-Blackburn-Arsenal',
            'authority' = 'www.whoscored.com',
            'x-requested-with' = 'XMLHttpRequest')

params = list(
  'category' = 'summary',
  'subcategory' = 'all',
  'statsAccumulationType' = '0',
  'isCurrent' = 'true',
  'playerId' = '',
  'teamIds' = '158',
  'matchId' = '318578',
  'stageId' = '',
  'tournamentOptions' = '',
  'sortBy' = '',
  'sortAscending' = '',
  'age' = '',
  'ageComparisonType' = '',
  'appearances' = '',
  'appearancesComparisonType' = '',
  'field' = '',
  'nationality' = '',
  'positionOptions' = '',
  'timeOfTheGameEnd' = '',
  'timeOfTheGameStart' = '',
  'isMinApp' = '',
  'page' = '',
  'includeZeroValues' = '',
  'numberOfPlayersToPick' = ''
)

r <- httr::GET(url = 'https://www.whoscored.com/StatisticsFeed/1/GetMatchCentrePlayerStatistics', httr::add_headers(.headers=headers), query = params)

data <- jsonlite::fromJSON(content(r,as="text") )
print(data$playerTableStats)

Small sample of contents of data$playerTableStats via View(data$playerTableStats). You would parse as required for info you want in format you want.

来源：https://stackoverflow.com/questions/57547825/xml-nodeset-0-issue-when-webscraping-table

标签

web-scraping

rvest