Scrape values from HTML select/option tags in R

问题

I'm trying (fairly unsuccessfully) to scrape some data from a website (www.majidata.co.ke) using R. I've managed to scrape the HTML and parse it but now a little unsure how to extract the bits I actually need!

Using the XML library I scrape my data using this code:

majidata_get <- GET("http://www.majidata.go.ke/town.php?MID=MTE=&SMID=MTM=")
majidata_html <- htmlTreeParse(content(majidata_get, as="text"))

This leaves me with (Large) XMLDocumentContent. There is a drop-down list on the webpage and I want to scrape the values from it (which relate to the names and ID no. of different towns). The bits I want to extract are the numbers between <option value ="XXX"> and the name following it in capital letters.

<div class="regiondata">
       <div id="town_data">
        <select id="town" name="town" onchange="town_data(this.value);">
         <option value="0" selected="selected">[SELECT TOWN]</option>
         <option value="611">AHERO</option>
         <option value="635">AKALA</option>
         <option value="625">AWASI</option>
         <option value="628">AWENDO</option>
         <option value="749">BAHATI</option>
         <option value="327">BANGALE</option>

Ideally, I'd like to have these in a data.frame where the first column is the number and second column is the name e.g.

ID       Name
611      AHERO
635      AKALA
625      AWASI

etc.

I'm not really sure where to go from here. I had thought to use regex and match the pattern within the text, though I've read from a number of forums that this is a bad idea an that its better/more efficient to use the xpath. Not really sure where to start with this though other than thinking I need to use xpathApplysomehow.

回答1:

The very new rvest package makes quick work of this and lets you use sane CSS selectors, too.

UPDATED Incorporates the second request (see comments below)

library(rvest)
library(dplyr)

# gets data from the second popup
# returns a data frame of town_id, town_name, area_id, area_name
addArea <- function(town_id, town_name) {

  # make the AJAX URL and grab the data
  url <- sprintf("http://www.majidata.go.ke/ajax-list-area.php?reg=towns&type=projects&id=%s",
                 town_id)
  subunits <- html(url)

  # reformat into a data frame with the town data
  data.frame(town_id=town_id,
             town_name=town_name,
             area_id=subunits %>% html_nodes("option") %>% html_attr("value"),
             area_name=subunits %>% html_nodes("option") %>% html_text(),
             stringsAsFactors=FALSE)[-1,]

}

# get data from the first popup and put it into a dat a frame
majidata <- html("http://www.majidata.go.ke/town.php?MID=MTE=&SMID=MTM=")
maji <- data.frame(town_id=majidata %>% html_nodes("#town option") %>% html_attr("value"),
                   town_name=majidata %>% html_nodes("#town option") %>% html_text(),
                   stringsAsFactors=FALSE)[-1,]

# pass in the name and id to our addArea function and make the result into
# a data frame with all the data (town and area)
combined <- do.call("rbind.data.frame",
                    mapply(addArea, maji$town_id,  maji$town_name,
                           SIMPLIFY=FALSE, USE.NAMES=FALSE))

# row names aren't super-important, but let's keep them tidy
rownames(combined) <- NULL

str(combined)

## 'data.frame':    1964 obs. of  4 variables:
##  $ town_id  : chr  "611" "635" "625" "628" ...
##  $ town_name: chr  "AHERO" "AKALA" "AWASI" "AWENDO" ...
##  $ area_id  : chr  "60603030101" "60107050201" "60603020101" "61103040101" ...
##  $ area_name: chr  "AHERO" "AKALA" "AWASI" "ANINDO" ...


head(combined)

##   town_id town_name     area_id area_name
## 1     611     AHERO 60603030101     AHERO
## 2     635     AKALA 60107050201     AKALA
## 3     625     AWASI 60603020101     AWASI
## 4     628    AWENDO 61103040101    ANINDO
## 5     628    AWENDO 61103050401      SARE
## 6     749    BAHATI 73101010101    BAHATI

回答2:

Using xpath expressions with HTML is almost always a better choice than regex. Given this data you can extract what you're after with

options<-getNodeSet(xmlRoot(majidata_html), "//select[@id='town']/option")

ids <- sapply(options, xmlGetAttr, "value")
names <- sapply(options, xmlValue)

data.frame(ID=ids, Name=names)

which returns

   ID          Name
1   0 [SELECT TOWN]
2 611         AHERO
3 635         AKALA
4 625         AWASI
5 628        AWENDO
6 749        BAHATI
...

来源：https://stackoverflow.com/questions/25965785/scrape-values-from-html-select-option-tags-in-r

标签

html

web-scraping

rvest