Cleaning HTML code in R: how to clean this list?

问题

I know that this question has been asked here tons of times but after reading a bunch of topics I'm still stucked on this :( . I've a list of scraped html nodes like this

<a href="http://bit.d o/bnRinN9" target="_blank" style="color: #ff7700; font-weight: bold;">http://bit.d o/bnRinN9</a>

and I just want to clean all code part. Unfortunately I'm a newbie and the only thing it comes to my mind is the Cthulhu way (regex, argh!). Which way I can do this?

*I put a space between "d" and "o" in domain name because SO doesn't allow to post that link

回答1:

This uses the data linked in Why R can't scrape these links? which was downloaded.

library(rvest)
library(stringr)

# read the saved htm page and make one string
lines <- readLines("~/Downloads/testlink.html")
text <- paste0(lines, collapse = "\n")

# the lnks are within a table, within spans. There issnt much structure
# and no identfiers so it needs a little hacking to get the right elements
# There probably are smarter css selectors that could avoid the hacks
spans <- read_html(text) %>% xml_nodes(css = "table tbody tr td span")

# extract all the short links -- but remove the links to edit
# note these links have a trailing dash - links to the statistics 
# not the content
short_links <- spans %>% xml_nodes("a") %>% xml_attr("href")
short_links <- short_links[!str_detect(short_links, "/edit")]

# the real urls are in the html text, prefixed with http
span_text  <- spans %>% html_text() %>% unlist()
long_links <- span_text[str_detect(span_text, "http")]

# > short_links
# [1] "http://bit.dxo/scrprtest7-" "http://bit.dxo/scrprtest6-" "http://bit.dxo/scrprtest5-" "http://bit.dxo/scrprtest4-" "http://bit.dxo/scrprtest3-"
# [6] "http://bit.dxo/scrprtest2-" "http://bit.dox/scrprtest1-"
# > long_links
# [1] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [2] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [3] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [4] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [5] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [6] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"
# [7] "http://edition.cnn.com/2017/07/21/opinions/trump-russia-putin-lain-opinion/index.html"

回答2:

library rvest includes many simple functions for scraping and processing html. It depends on package xml2. Generally you can scrape and filter in one step.

It's not clear if you want to extract the href value or the html text, which are the same in your example. This code extracts the href value by finding the a nodes and then the html attribute href. alternatively you can use html_text to get the link display text.

library(rvest)
links <- list('
<a href="http://anydomain.com/bnRinN9" target="_blank" style="color: #ff7700; font-weight: bold;">http://anydomain.com/bnRinN9</a>
<a href="domain.com/page">
')

# make one string
text <- paste0(links, collapse = "\n")
hrefs <- read_html(text) %>% xml_nodes("a") %>% xml_attr("href")
hrefs  


## [1] "http://anydomain.com/bnRinN9" "domain.com/page"

来源：https://stackoverflow.com/questions/45564669/cleaning-html-code-in-r-how-to-clean-this-list

标签

regex

gsub