问题
I am trying to pull mutual funds data into R, My way of code works for single table but when there are multiple tables in a webpage, it doesn't work.
Link - https://in.finance.yahoo.com/q/pm?s=115748.BO
My Code
url <- "https://in.finance.yahoo.com/q/pm?s=115748.BO"
library(XML)
perftable <- readHTMLTable(url, header = T, which = 1, stringsAsFactors = F)
but i am getting an error message.
Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘readHTMLTable’ for signature ‘"NULL"’ In addition: Warning message: XML content does not seem to be XML: 'https://in.finance.yahoo.com/q/pm?s=115748.BO'
My Question is
- How to pull a specific table out of this webpage?
- How to pull all tables out of this webpage?
- when there are multiple links, what would be the easy way to pull specific table from each those webpages
Ahttps://in.finance.yahoo.com/q/pm?s=115748.BO
Ahttps://in.finance.yahoo.com/q/pm?s=115749.BO
Ahttps://in.finance.yahoo.com/q/pm?s=115750.BO
Remove "A" From the link, while using the link.
回答1:
Base R is not able to access https
. You can use a package like RCurl
. The headers on the tables are actually seperate tables. The page is actually composed of 30+ tables. The data you want is most like given by table with a class = yfnc_datamodoutline1
:
url <- "https://in.finance.yahoo.com/q/pm?s=115748.BO"
library(XML)
library(RCurl)
appData <- getURL(url, ssl.verifypeer = FALSE)
doc <- htmlParse(appData)
appData <- doc['//table[@class="yfnc_datamodoutline1"]']
perftable <- readHTMLTable(appData[[1]], stringsAsFactors = F)
> perftable
V1 V2
1 Morningstar Return Rating: 2.00
2 Year-to-Date Return: 2.77%
3 5-Year Average Return: 9.76%
4 Number of Years Up: 4
5 Number of Years Down: 1
6 Best 1 Yr Total Return (2014-12-31): 37.05%
7 Worst 1 Yr Total Return (2011-12-31): -27.26%
8 Best 3-Yr Total Return (N/A): 23.11%
9 Worst 3-Yr Total Return (N/A): -0.33%
回答2:
Here's an rvest
version with an added function to extract a particular table from each fund page:
library(rvest)
library(dplyr)
pages <- c("https://in.finance.yahoo.com/q/pm?s=115748.BO",
"https://in.finance.yahoo.com/q/pm?s=115749.BO",
"https://in.finance.yahoo.com/q/pm?s=115750.BO")
extract_tab <- function(sources, tab_idx) {
data <- lapply(sources, function(x) {
pg <- html(x)
pg %>% html_nodes(xpath="//table[@class='yfnc_datamodoutline1']//table") -> tabs
html_table(tabs[[tab_idx]])
})
names(data) <- gsub("pm\\?s=", "", basename(sources))
data
}
extract_tab(pages, 1)
## $`115748.BO`
## X1 X2
## 1 Morningstar Return Rating: 2.00
## 2 Year-to-Date Return: 2.77%
## 3 5-Year Average Return: 9.76%
## 4 Number of Years Up: 4
## 5 Number of Years Down: 1
## 6 Best 1 Yr Total Return (2014-12-31): 37.05%
## 7 Worst 1 Yr Total Return (2011-12-31): -27.26%
## 8 Best 3-Yr Total Return (N/A): 23.11%
## 9 Worst 3-Yr Total Return (N/A): -0.33%
##
## $`115749.BO`
## X1 X2
## 1 Morningstar Return Rating: 2.00
## 2 Year-to-Date Return: 2.77%
## 3 5-Year Average Return: 9.77%
## 4 Number of Years Up: 4
## 5 Number of Years Down: 1
## 6 Best 1 Yr Total Return (2014-12-31): 37.05%
## 7 Worst 1 Yr Total Return (2011-12-31): -27.22%
## 8 Best 3-Yr Total Return (N/A): 23.11%
## 9 Worst 3-Yr Total Return (N/A): -0.30%
##
## $`115750.BO`
## X1 X2
## 1 Morningstar Return Rating:
## 2 Year-to-Date Return: 1.95%
## 3 5-Year Average Return: 8.92%
## 4 Number of Years Up:
## 5 Number of Years Down:
## 6 Best 1 Yr Total Return (): N/A
## 7 Worst 1 Yr Total Return (): N/A
## 8 Best 3-Yr Total Return (N/A): N/A
## 9 Worst 3-Yr Total Return (N/A): N/A
来源:https://stackoverflow.com/questions/29935512/scraping-multiple-table-out-of-webpage-in-r