问题
I am trying to extract the table that is on the page
Using html_table and rvest, However the first text, first row, is part of the table and apparently is causing conflicts with html_table. I leave the code
#Library's
library(rvest)
library(XML)
url<-"http://www.svs.cl/institucional/mercados/consulta.php?mercado=V&Estado=VI&entidad=RVEMI" #page
url<-read_html(url)
table<-html_nodes(url,"table") #read notes
table<-html_table(table,fill=TRUE) #write like table
ANd the error is
Error in if (length(p) > 1 & maxp * n != sum(unlist(nrows)) & maxp * n != : missing value where TRUE/FALSE needed In addition: Warning message: In lapply(ncols, as.integer) : NAs introduced by coercion
Maybe it could be written using html_text, but I need it in table format.
Any help is appreciated
回答1:
It's not the size of the table but the extremely gnarly nodes in the first two rows.
So, just edit out the problem nodes.
xml2
supports a much wider array of libxml2
operations, now:
library(rvest)
library(tidyverse)
pg <- read_html("http://www.svs.cl/institucional/mercados/consulta.php?mercado=V&Estado=VI&entidad=RVEMI")
xml_remove(html_nodes(pg, xpath=".//table/tr[1]"))
xml_remove(html_nodes(pg, xpath=".//table/tr[1]"))
html_nodes(pg, xpath=".//table") %>%
html_table() %>%
.[[1]] %>%
as_tibble()
## # A tibble: 368 × 3
## X1 X2 X3
## <chr> <chr> <chr>
## 1 76675290-K AD RETAIL S.A. VI
## 2 98000000-1 ADMINISTRADORA DE FONDOS DE PENSIONES CAPITAL S.A. VI
## 3 98000100-8 ADMINISTRADORA DE FONDOS DE PENSIONES HABITAT S.A. VI
## 4 76240079-0 ADMINISTRADORA DE FONDOS DE PENSIONES CUPRUM S.A. VI
## 5 76762250-3 ADMINISTRADORA DE FONDOS DE PENSIONES MODELO S.A. VI
## 6 98001200-K ADMINISTRADORA DE FONDOS DE PENSIONES PLANVITAL S.A. VI
## 7 76265736-8 ADMINISTRADORA DE FONDOS DE PENSIONES PROVIDA S.A. VI
## 8 94272000-9 AES GENER S.A. VI
## 9 96566940-K AGENCIAS UNIVERSALES S.A. VI
## 10 91253000-0 AGRICOLA NACIONAL S.A.C. E I. VI
## # ... with 358 more rows
Note you can do:
xml_remove(html_nodes(pg, xpath=".//table/tr[position() >= 1 and position() <=2]"))
instead of the two remove ops but it's almost as verbose and there's no real performance gain here.
回答2:
Here is a messing solution but it should work in this case. It looks like the first 2 rows of the HTML table are headers and that might be causing problems. I had to perform a brute force method of reading all of the cells and creating my own table.
library(rvest)
#library(XML) #not needed
url<-"http://www.svs.cl/institucional/mercados/consulta.php?mercado=V&Estado=VI&entidad=RVEMI" #page
url<-read_html(url)
table<-html_nodes(url,"table") #read notes
#find the rows and remove the first one
rows<-(html_nodes(table, "tr")[-1])
#now find each item in each row
values<-html_text(html_nodes(rows, "td"))
#clean up values by removing whitespace, /t, /r, /n
values<-trimws(gsub("(\\t|\\n|\\r)", "", values))
#covert into a data framme
finaltable<-as.data.frame(matrix(values, ncol=3, byrow=TRUE))
Hope this helps
来源:https://stackoverflow.com/questions/42989014/html-table-dont-work-with-long-row