Scraping multiple table out of webpage in R

╄→гoц情女王★ 提交于 2019-12-08 04:43:00

问题


I am trying to pull mutual funds data into R, My way of code works for single table but when there are multiple tables in a webpage, it doesn't work.

Link - https://in.finance.yahoo.com/q/pm?s=115748.BO

My Code

url <- "https://in.finance.yahoo.com/q/pm?s=115748.BO"
library(XML)
perftable <- readHTMLTable(url, header = T, which = 1, stringsAsFactors = F)

but i am getting an error message.

Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘readHTMLTable’ for signature ‘"NULL"’ In addition: Warning message: XML content does not seem to be XML: 'https://in.finance.yahoo.com/q/pm?s=115748.BO'

My Question is

  1. How to pull a specific table out of this webpage?
  2. How to pull all tables out of this webpage?
  3. when there are multiple links, what would be the easy way to pull specific table from each those webpages

Ahttps://in.finance.yahoo.com/q/pm?s=115748.BO

Ahttps://in.finance.yahoo.com/q/pm?s=115749.BO

Ahttps://in.finance.yahoo.com/q/pm?s=115750.BO

Remove "A" From the link, while using the link.


回答1:


Base R is not able to access https. You can use a package like RCurl. The headers on the tables are actually seperate tables. The page is actually composed of 30+ tables. The data you want is most like given by table with a class = yfnc_datamodoutline1 :

url <- "https://in.finance.yahoo.com/q/pm?s=115748.BO"
library(XML)
library(RCurl)
appData <- getURL(url, ssl.verifypeer = FALSE)
doc <- htmlParse(appData)
appData <- doc['//table[@class="yfnc_datamodoutline1"]']
perftable <- readHTMLTable(appData[[1]], stringsAsFactors = F)
> perftable
V1      V2
1            Morningstar Return Rating:    2.00
2                  Year-to-Date Return:   2.77%
3                5-Year Average Return:   9.76%
4                   Number of Years Up:       4
5                 Number of Years Down:       1
6  Best 1 Yr Total Return (2014-12-31):  37.05%
7 Worst 1 Yr Total Return (2011-12-31): -27.26%
8         Best 3-Yr Total Return (N/A):  23.11%
9        Worst 3-Yr Total Return (N/A):  -0.33%



回答2:


Here's an rvest version with an added function to extract a particular table from each fund page:

library(rvest)
library(dplyr)

pages <- c("https://in.finance.yahoo.com/q/pm?s=115748.BO", 
           "https://in.finance.yahoo.com/q/pm?s=115749.BO",
           "https://in.finance.yahoo.com/q/pm?s=115750.BO")


extract_tab <- function(sources, tab_idx) {

  data <- lapply(sources, function(x) {

    pg <- html(x)
    pg %>% html_nodes(xpath="//table[@class='yfnc_datamodoutline1']//table") -> tabs
    html_table(tabs[[tab_idx]])

  })

  names(data) <- gsub("pm\\?s=", "", basename(sources))

  data

}

extract_tab(pages, 1)

## $`115748.BO`
##                                      X1      X2
## 1            Morningstar Return Rating:    2.00
## 2                  Year-to-Date Return:   2.77%
## 3                5-Year Average Return:   9.76%
## 4                   Number of Years Up:       4
## 5                 Number of Years Down:       1
## 6  Best 1 Yr Total Return (2014-12-31):  37.05%
## 7 Worst 1 Yr Total Return (2011-12-31): -27.26%
## 8         Best 3-Yr Total Return (N/A):  23.11%
## 9        Worst 3-Yr Total Return (N/A):  -0.33%
## 
## $`115749.BO`
##                                      X1      X2
## 1            Morningstar Return Rating:    2.00
## 2                  Year-to-Date Return:   2.77%
## 3                5-Year Average Return:   9.77%
## 4                   Number of Years Up:       4
## 5                 Number of Years Down:       1
## 6  Best 1 Yr Total Return (2014-12-31):  37.05%
## 7 Worst 1 Yr Total Return (2011-12-31): -27.22%
## 8         Best 3-Yr Total Return (N/A):  23.11%
## 9        Worst 3-Yr Total Return (N/A):  -0.30%
## 
## $`115750.BO`
##                               X1    X2
## 1     Morningstar Return Rating:      
## 2           Year-to-Date Return: 1.95%
## 3         5-Year Average Return: 8.92%
## 4            Number of Years Up:      
## 5          Number of Years Down:      
## 6     Best 1 Yr Total Return ():   N/A
## 7    Worst 1 Yr Total Return ():   N/A
## 8  Best 3-Yr Total Return (N/A):   N/A
## 9 Worst 3-Yr Total Return (N/A):   N/A


来源:https://stackoverflow.com/questions/29935512/scraping-multiple-table-out-of-webpage-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!