I want to parse this HTML: and get this elements from it:
a) p
tag, with class: \"normal_encontrado\"
.
b) div
with c
Using the XML package parse the input with xmlTreeParse
and then use xpathSApply
to interate over the product_price
class div
nodes. For each such node the anonyous function gets the value of the div
and p
subnodes. The resulting character matrix m
is reworked into a data frame DF
and the columns are cleaned removing any character that is not a dot or digit and also removing any dot followed by a non-digit. Copnvert result to numeric. Note that no special processing for the missing p
case is needed.
# input
Lines <- '
S/. 2,799.00
S/. 2,299.00
S/. 4,999.00
'
# code to read input and produce a data.frame
library(XML)
doc <- xmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE)
m <- xpathSApply(doc, "//div[@class = 'product_price']", function(node) {
list(p = xmlValue(node[["p"]]), div = xmlValue(node[["div"]])) })
DF <- as.data.frame(t(m), stringsAsFactors = FALSE) # rework into data frame
DF[] <- lapply(DF, function(x) as.numeric(gsub("[^.0-9]|[.]\\D", "", x))) # clean
The result is:
> DF
p div
1 2799 2299
2 NA 4999