问题
I am looking for code that can inform me in which cell of an html table a particular icon resides. Here is what I am working with:
u <- "http://www.transfermarkt.nl/lionel-messi/leistungsdaten/spieler/28003/saison/2014/plus/1"
doc <- rvest::html(u)
tab <- rvest::html_table(doc, fill = TRUE)[[6]]
The column "Pos." designates the player's position in the field. Some of these have an additional icon. I can see the presence of these icons on the page as follows:
rvest::html_nodes(doc, ".kapitaenicon-table")
but this doesn't tell me WHERE they are. I would like my code to return that the icon occurs in rows 2, 10, 11, 27 of the "Pos. column" in the table. How can I do that?
回答1:
A little bit more rvest
and XPath magic can get you the indices:
library(rvest)
library(magrittr)
library(XML)
pg <- html("http://www.transfermarkt.nl/lionel-messi/leistungsdaten/spieler/28003/saison/2014/plus/1")
pg %>%
html_nodes("table") %>%
extract2(6) %>%
html_nodes("tbody > tr") %>%
sapply(function(x) {
length(xpathSApply(x, "./td[8]/span[@class='kapitaenicon-table icons_sprite']")) == 1
}) %>% which
## [1] 2 10 11 27
That gets the 6th table, extracts the tr
s then looks through them for an 8th td
with the proper span
/class
in it. If the XPath search fails it returns an empty list, so you can use the length to determine which rows have the td
with the icon in them and which do not.
This:
pg %>%
html_nodes(xpath="//table[6]/tbody/tr/td[8]") %>%
xmlSApply(xpathApply, "boolean(./span[@class='kapitaenicon-table icons_sprite'])") %>%
which
also works and it a bit tighter (and faster). It uses the XPath boolean
operation to test for existence. This is handier if you have no other operations to perform on the node(s).
This is an xml2
version, though I have to believe there has to be a better way to do this in xml2
:
library(xml2)
library(magrittr)
pg2 <- read_html("http://www.transfermarkt.nl/lionel-messi/leistungsdaten/spieler/28003/saison/2014/plus/1")
pg2 %>%
xml_find_all("//table[6]/tbody/tr/td[8]") %>%
as_list %>%
sapply(function(x) {
inherits(try(xml_find_one(x, "./span"), silent=TRUE), "xml_node")
}) %>% which
UPDATE
For version 0.1.0.9000
of xml2
I had to do the following:
pg2 %>% xml_find_all("//table") %>%
as_list %>%
extract2(6) %>%
xml_find_all("./tbody/tr/td[8]") %>%
as_list %>%
sapply(function(x) {
inherits(try(xml_find_one(x, "./span"), silent=TRUE), "xml_node")
}) %>% which
That should not be the case and I've filed a bug report.
Session info -------------------------------------------------------------------------
setting value
version R version 3.2.0 (2015-04-16)
system x86_64, darwin13.4.0
ui RStudio (0.99.441)
language (EN)
collate en_US.UTF-8
tz America/New_York
Packages -----------------------------------------------------------------------------
package * version date source
curl * 0.5 2015-02-01 CRAN (R 3.2.0)
devtools * 1.7.0 2015-01-17 CRAN (R 3.2.0)
magrittr 1.5 2014-11-22 CRAN (R 3.2.0)
Rcpp * 0.11.5 2015-03-06 CRAN (R 3.2.0)
rstudioapi * 0.3.1 2015-04-07 CRAN (R 3.2.0)
xml2 0.1.0 2015-04-20 CRAN (R 3.2.0)
来源:https://stackoverflow.com/questions/30556130/find-cell-in-html-table-containing-a-specific-icon