Find cell in html table containing a specific icon

跟風遠走 提交于 2019-12-22 10:23:11

问题


I am looking for code that can inform me in which cell of an html table a particular icon resides. Here is what I am working with:

u <- "http://www.transfermarkt.nl/lionel-messi/leistungsdaten/spieler/28003/saison/2014/plus/1"
doc <- rvest::html(u)
tab <- rvest::html_table(doc, fill = TRUE)[[6]]

The column "Pos." designates the player's position in the field. Some of these have an additional icon. I can see the presence of these icons on the page as follows:

rvest::html_nodes(doc, ".kapitaenicon-table")

but this doesn't tell me WHERE they are. I would like my code to return that the icon occurs in rows 2, 10, 11, 27 of the "Pos. column" in the table. How can I do that?


回答1:


A little bit more rvest and XPath magic can get you the indices:

library(rvest)
library(magrittr)
library(XML)

pg <- html("http://www.transfermarkt.nl/lionel-messi/leistungsdaten/spieler/28003/saison/2014/plus/1")

pg %>% 
  html_nodes("table") %>% 
  extract2(6) %>% 
  html_nodes("tbody > tr") %>% 
  sapply(function(x) {
    length(xpathSApply(x, "./td[8]/span[@class='kapitaenicon-table icons_sprite']")) == 1
  }) %>% which

## [1]  2 10 11 27

That gets the 6th table, extracts the trs then looks through them for an 8th td with the proper span/class in it. If the XPath search fails it returns an empty list, so you can use the length to determine which rows have the td with the icon in them and which do not.

This:

pg %>% 
  html_nodes(xpath="//table[6]/tbody/tr/td[8]") %>% 
  xmlSApply(xpathApply, "boolean(./span[@class='kapitaenicon-table icons_sprite'])") %>% 
  which

also works and it a bit tighter (and faster). It uses the XPath boolean operation to test for existence. This is handier if you have no other operations to perform on the node(s).

This is an xml2 version, though I have to believe there has to be a better way to do this in xml2:

library(xml2)
library(magrittr)

pg2 <- read_html("http://www.transfermarkt.nl/lionel-messi/leistungsdaten/spieler/28003/saison/2014/plus/1")
pg2 %>% 
  xml_find_all("//table[6]/tbody/tr/td[8]") %>% 
  as_list %>% 
  sapply(function(x) {
    inherits(try(xml_find_one(x, "./span"), silent=TRUE), "xml_node")
  }) %>% which

UPDATE

For version 0.1.0.9000 of xml2 I had to do the following:

pg2 %>% xml_find_all("//table") %>% 
  as_list %>% 
  extract2(6) %>% 
  xml_find_all("./tbody/tr/td[8]") %>% 
  as_list %>% 
  sapply(function(x) {
    inherits(try(xml_find_one(x, "./span"), silent=TRUE), "xml_node")
  }) %>% which

That should not be the case and I've filed a bug report.

Session info -------------------------------------------------------------------------
 setting  value                       
 version  R version 3.2.0 (2015-04-16)
 system   x86_64, darwin13.4.0        
 ui       RStudio (0.99.441)          
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       America/New_York            

Packages -----------------------------------------------------------------------------
 package    * version date       source        
 curl       * 0.5     2015-02-01 CRAN (R 3.2.0)
 devtools   * 1.7.0   2015-01-17 CRAN (R 3.2.0)
 magrittr     1.5     2014-11-22 CRAN (R 3.2.0)
 Rcpp       * 0.11.5  2015-03-06 CRAN (R 3.2.0)
 rstudioapi * 0.3.1   2015-04-07 CRAN (R 3.2.0)
 xml2         0.1.0   2015-04-20 CRAN (R 3.2.0)


来源:https://stackoverflow.com/questions/30556130/find-cell-in-html-table-containing-a-specific-icon

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!