问题
I'm trying (fairly unsuccessfully) to scrape some data from a website (www.majidata.co.ke) using R. I've managed to scrape the HTML and parse it but now a little unsure how to extract the bits I actually need!
Using the XML
library I scrape my data using this code:
majidata_get <- GET("http://www.majidata.go.ke/town.php?MID=MTE=&SMID=MTM=")
majidata_html <- htmlTreeParse(content(majidata_get, as="text"))
This leaves me with (Large) XMLDocumentContent. There is a drop-down list on the webpage and I want to scrape the values from it (which relate to the names and ID no. of different towns). The bits I want to extract are the numbers between <option value ="XXX">
and the name following it in capital letters.
<div class="regiondata">
<div id="town_data">
<select id="town" name="town" onchange="town_data(this.value);">
<option value="0" selected="selected">[SELECT TOWN]</option>
<option value="611">AHERO</option>
<option value="635">AKALA</option>
<option value="625">AWASI</option>
<option value="628">AWENDO</option>
<option value="749">BAHATI</option>
<option value="327">BANGALE</option>
Ideally, I'd like to have these in a data.frame where the first column is the number and second column is the name e.g.
ID Name
611 AHERO
635 AKALA
625 AWASI
etc.
I'm not really sure where to go from here. I had thought to use regex and match the pattern within the text, though I've read from a number of forums that this is a bad idea an that its better/more efficient to use the xpath. Not really sure where to start with this though other than thinking I need to use xpathApply
somehow.
回答1:
The very new rvest package makes quick work of this and lets you use sane CSS selectors, too.
UPDATED Incorporates the second request (see comments below)
library(rvest)
library(dplyr)
# gets data from the second popup
# returns a data frame of town_id, town_name, area_id, area_name
addArea <- function(town_id, town_name) {
# make the AJAX URL and grab the data
url <- sprintf("http://www.majidata.go.ke/ajax-list-area.php?reg=towns&type=projects&id=%s",
town_id)
subunits <- html(url)
# reformat into a data frame with the town data
data.frame(town_id=town_id,
town_name=town_name,
area_id=subunits %>% html_nodes("option") %>% html_attr("value"),
area_name=subunits %>% html_nodes("option") %>% html_text(),
stringsAsFactors=FALSE)[-1,]
}
# get data from the first popup and put it into a dat a frame
majidata <- html("http://www.majidata.go.ke/town.php?MID=MTE=&SMID=MTM=")
maji <- data.frame(town_id=majidata %>% html_nodes("#town option") %>% html_attr("value"),
town_name=majidata %>% html_nodes("#town option") %>% html_text(),
stringsAsFactors=FALSE)[-1,]
# pass in the name and id to our addArea function and make the result into
# a data frame with all the data (town and area)
combined <- do.call("rbind.data.frame",
mapply(addArea, maji$town_id, maji$town_name,
SIMPLIFY=FALSE, USE.NAMES=FALSE))
# row names aren't super-important, but let's keep them tidy
rownames(combined) <- NULL
str(combined)
## 'data.frame': 1964 obs. of 4 variables:
## $ town_id : chr "611" "635" "625" "628" ...
## $ town_name: chr "AHERO" "AKALA" "AWASI" "AWENDO" ...
## $ area_id : chr "60603030101" "60107050201" "60603020101" "61103040101" ...
## $ area_name: chr "AHERO" "AKALA" "AWASI" "ANINDO" ...
head(combined)
## town_id town_name area_id area_name
## 1 611 AHERO 60603030101 AHERO
## 2 635 AKALA 60107050201 AKALA
## 3 625 AWASI 60603020101 AWASI
## 4 628 AWENDO 61103040101 ANINDO
## 5 628 AWENDO 61103050401 SARE
## 6 749 BAHATI 73101010101 BAHATI
回答2:
Using xpath expressions with HTML is almost always a better choice than regex. Given this data you can extract what you're after with
options<-getNodeSet(xmlRoot(majidata_html), "//select[@id='town']/option")
ids <- sapply(options, xmlGetAttr, "value")
names <- sapply(options, xmlValue)
data.frame(ID=ids, Name=names)
which returns
ID Name
1 0 [SELECT TOWN]
2 611 AHERO
3 635 AKALA
4 625 AWASI
5 628 AWENDO
6 749 BAHATI
...
来源:https://stackoverflow.com/questions/25965785/scrape-values-from-html-select-option-tags-in-r