{xml_nodeset (0)} issue when webscraping table

余生长醉 提交于 2019-12-24 11:17:06

问题


I'm trying to scrape the first table from this url:

https://www.whoscored.com/Matches/318578/LiveStatistics/England-Premier-League-2009-2010-Blackburn-Arsenal

using the following code:

url <- "https://www.whoscored.com/Matches/318578/LiveStatistics/England-Premier-League-2009-2010-Blackburn-Arsenal"
data <- url %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="top-player-stats-summary-grid"]')

which gives data a value of {xml_nodeset (0)}

url <- "https://www.whoscored.com/Matches/318578/LiveStatistics/England-Premier-League-2009-2010-Blackburn-Arsenal"
data <- url %>%
  read_html() %>%
  html_nodes(css='.grid')

gives the same problem.

Apparently this might be a javascript issue - is there a fast way to extract the relevant data? Inspecting the table entries seems to show that the data is not imported from elsewhere but is coded into the page, so it seems I should be able to extract it from the source code (sorry, I am completely ignorant of how HTML and JS work so my question might not make sense).


回答1:


The page dynamically updates content via javascript running on page when using browser. This doesn't happen with rvest. You can however observe in dev tools network tab the xhr call which returns this content as json

require(httr)
require(jsonlite)

headers = c('user-agent' = 'Mozilla/5.0',
            'accept' = 'application/json, text/javascript, */*; q=0.01',
           'referer' = 'https://www.whoscored.com/Matches/318578/LiveStatistics/England-Premier-League-2009-2010-Blackburn-Arsenal',
            'authority' = 'www.whoscored.com',
            'x-requested-with' = 'XMLHttpRequest')

params = list(
  'category' = 'summary',
  'subcategory' = 'all',
  'statsAccumulationType' = '0',
  'isCurrent' = 'true',
  'playerId' = '',
  'teamIds' = '158',
  'matchId' = '318578',
  'stageId' = '',
  'tournamentOptions' = '',
  'sortBy' = '',
  'sortAscending' = '',
  'age' = '',
  'ageComparisonType' = '',
  'appearances' = '',
  'appearancesComparisonType' = '',
  'field' = '',
  'nationality' = '',
  'positionOptions' = '',
  'timeOfTheGameEnd' = '',
  'timeOfTheGameStart' = '',
  'isMinApp' = '',
  'page' = '',
  'includeZeroValues' = '',
  'numberOfPlayersToPick' = ''
)

r <- httr::GET(url = 'https://www.whoscored.com/StatisticsFeed/1/GetMatchCentrePlayerStatistics', httr::add_headers(.headers=headers), query = params)

data <- jsonlite::fromJSON(content(r,as="text") )
print(data$playerTableStats)

Small sample of contents of data$playerTableStats via View(data$playerTableStats). You would parse as required for info you want in format you want.



来源:https://stackoverflow.com/questions/57547825/xml-nodeset-0-issue-when-webscraping-table

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!