use rvest and css selector to extract table from scraped search results

一世执手 提交于 2019-12-13 16:08:07

问题


Just learned about rvest on Hadley's great webinar and trying it out for the first time.

I want to scrape (and then plot) the baseball standings table returned from a Google search result.

My problem is that I cannot get in rvest the table I see in my browser plug-in.

library(rvest)
library(magrittr) # for %>% operator

( g_search <-html_session(url = "http://www.google.com/?q=mlb+standings", 
                          add_headers("user-agent" = "Mozilla/5.0")) )
# <session> http://www.google.com/?q=mlb+standings
#   Status: 200
#   Type:   text/html; charset=UTF-8
#   Size:   52500

This search should return a page with a table buried under many layers but uniquely identified by <div class="tb_strip">. A quick stop at the CSS Diner teaches me (I think) that "div.tb_strip" is a valid CSS selector to capture this table (and possibly other junk). And indeed, using Firebug's CSS selector, I see the full path:

# Use Firebug "Copy CSS Path" and paste into table_path
table_path <- "html body#gsr.srp.tbo.vasq div#main div#cnt.big div.mw div#rcnt div.col div#center_col div#res.med div#search div div#ires ol#rso li.g.tpo.knavi.obcontainer div.kp-blk div#uid_0.r-iCGI_bFBahQE.xpdbox.xpdopen div div.lr_container.mod div#lr_tab_unit_uid_1.tb_u.r-igQv_rxlT08k div.tb_view div.tb_strip"

However, the following attempt to access this table fails due to html_nodes returning an empty list.

( standings <- g_search %>% 
    html_nodes("div.tb_strip") %>% 
    html_table() 
  ) #returns empty list

The content does not seem to be making it into g_search, so I don't know yet whether the CSS selector worked.

grep("tb_strip",html_text(read_html("http://www.google.com/?q=mlb+standings")) ) # empty

Where did it go?

TYVM


回答1:


Here's an example from an easier site...

library("rvest")
url <- "http://sports.yahoo.com/mlb/standings/"
html(url) %>% html_nodes(".yui3-tabview-content") %>% html_nodes("table") %>%html_table


来源:https://stackoverflow.com/questions/30850542/use-rvest-and-css-selector-to-extract-table-from-scraped-search-results

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!