Scraping multiple table out of webpage in R

问题

I am trying to pull mutual funds data into R, My way of code works for single table but when there are multiple tables in a webpage, it doesn't work.

Link - https://in.finance.yahoo.com/q/pm?s=115748.BO

My Code

url <- "https://in.finance.yahoo.com/q/pm?s=115748.BO"
library(XML)
perftable <- readHTMLTable(url, header = T, which = 1, stringsAsFactors = F)

but i am getting an error message.

Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘readHTMLTable’ for signature ‘"NULL"’ In addition: Warning message: XML content does not seem to be XML: 'https://in.finance.yahoo.com/q/pm?s=115748.BO'

My Question is

How to pull a specific table out of this webpage?
How to pull all tables out of this webpage?
when there are multiple links, what would be the easy way to pull specific table from each those webpages

Ahttps://in.finance.yahoo.com/q/pm?s=115748.BO

Ahttps://in.finance.yahoo.com/q/pm?s=115749.BO

Ahttps://in.finance.yahoo.com/q/pm?s=115750.BO

Remove "A" From the link, while using the link.

回答1:

Base R is not able to access https. You can use a package like RCurl. The headers on the tables are actually seperate tables. The page is actually composed of 30+ tables. The data you want is most like given by table with a class = yfnc_datamodoutline1 :

url <- "https://in.finance.yahoo.com/q/pm?s=115748.BO"
library(XML)
library(RCurl)
appData <- getURL(url, ssl.verifypeer = FALSE)
doc <- htmlParse(appData)
appData <- doc['//table[@class="yfnc_datamodoutline1"]']
perftable <- readHTMLTable(appData[[1]], stringsAsFactors = F)
> perftable
V1      V2
1            Morningstar Return Rating:    2.00
2                  Year-to-Date Return:   2.77%
3                5-Year Average Return:   9.76%
4                   Number of Years Up:       4
5                 Number of Years Down:       1
6  Best 1 Yr Total Return (2014-12-31):  37.05%
7 Worst 1 Yr Total Return (2011-12-31): -27.26%
8         Best 3-Yr Total Return (N/A):  23.11%
9        Worst 3-Yr Total Return (N/A):  -0.33%

回答2:

Here's an rvest version with an added function to extract a particular table from each fund page:

library(rvest)
library(dplyr)

pages <- c("https://in.finance.yahoo.com/q/pm?s=115748.BO", 
           "https://in.finance.yahoo.com/q/pm?s=115749.BO",
           "https://in.finance.yahoo.com/q/pm?s=115750.BO")


extract_tab <- function(sources, tab_idx) {

  data <- lapply(sources, function(x) {

    pg <- html(x)
    pg %>% html_nodes(xpath="//table[@class='yfnc_datamodoutline1']//table") -> tabs
    html_table(tabs[[tab_idx]])

  })

  names(data) <- gsub("pm\\?s=", "", basename(sources))

  data

}

extract_tab(pages, 1)

## $`115748.BO`
##                                      X1      X2
## 1            Morningstar Return Rating:    2.00
## 2                  Year-to-Date Return:   2.77%
## 3                5-Year Average Return:   9.76%
## 4                   Number of Years Up:       4
## 5                 Number of Years Down:       1
## 6  Best 1 Yr Total Return (2014-12-31):  37.05%
## 7 Worst 1 Yr Total Return (2011-12-31): -27.26%
## 8         Best 3-Yr Total Return (N/A):  23.11%
## 9        Worst 3-Yr Total Return (N/A):  -0.33%
## 
## $`115749.BO`
##                                      X1      X2
## 1            Morningstar Return Rating:    2.00
## 2                  Year-to-Date Return:   2.77%
## 3                5-Year Average Return:   9.77%
## 4                   Number of Years Up:       4
## 5                 Number of Years Down:       1
## 6  Best 1 Yr Total Return (2014-12-31):  37.05%
## 7 Worst 1 Yr Total Return (2011-12-31): -27.22%
## 8         Best 3-Yr Total Return (N/A):  23.11%
## 9        Worst 3-Yr Total Return (N/A):  -0.30%
## 
## $`115750.BO`
##                               X1    X2
## 1     Morningstar Return Rating:      
## 2           Year-to-Date Return: 1.95%
## 3         5-Year Average Return: 8.92%
## 4            Number of Years Up:      
## 5          Number of Years Down:      
## 6     Best 1 Yr Total Return ():   N/A
## 7    Worst 1 Yr Total Return ():   N/A
## 8  Best 3-Yr Total Return (N/A):   N/A
## 9 Worst 3-Yr Total Return (N/A):   N/A

来源：https://stackoverflow.com/questions/29935512/scraping-multiple-table-out-of-webpage-in-r

标签

data.table

screen-scraping