Scraping with rvest - complete with NAs when tag is not present

后端 未结 4 2104
清酒与你
清酒与你 2020-11-30 12:20

I want to parse this HTML: and get this elements from it:

a) p tag, with class: \"normal_encontrado\".
b) div with c

4条回答
  •  渐次进展
    2020-11-30 12:26

    Using the XML package parse the input with xmlTreeParse and then use xpathSApply to interate over the product_price class div nodes. For each such node the anonyous function gets the value of the div and p subnodes. The resulting character matrix m is reworked into a data frame DF and the columns are cleaned removing any character that is not a dot or digit and also removing any dot followed by a non-digit. Copnvert result to numeric. Note that no special processing for the missing p case is needed.

    # input
    
    Lines <- '
    
    
    
    

    S/. 2,799.00

    S/. 2,299.00
    S/. 4,999.00
    ' # code to read input and produce a data.frame library(XML) doc <- xmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE) m <- xpathSApply(doc, "//div[@class = 'product_price']", function(node) { list(p = xmlValue(node[["p"]]), div = xmlValue(node[["div"]])) }) DF <- as.data.frame(t(m), stringsAsFactors = FALSE) # rework into data frame DF[] <- lapply(DF, function(x) as.numeric(gsub("[^.0-9]|[.]\\D", "", x))) # clean

    The result is:

    > DF
         p  div
    1 2799 2299
    2   NA 4999
    

提交回复
热议问题