问题
My problem
I am trying to parse a HTML file (downloaded via Google Drive API as text/html) to a list in R.
The HTML looks like this (sorry for the German content):
<p style='padding:0;margin:0;color:#000000;font-size:11pt;font-
family:"Arial";line-height:1.15;orphans:2;widows:2;text-align:left'>
<span>text: Das </span>
<span style="color:#1155cc;text-decoration:underline"><a
href="https://www.google.com/url?q=http://www.bundesverfassungsgericht.de/SharedDocs/Entscheidungen/DE/2011/10/rs20111012_2bvr023608.html&sa=D&ust=1503574789125000&usg=AFQ
jCNE4Ij3mvMX-QttYQYqspAaMxaZaeg" style="color:inherit;text-
decoration:inherit">Verfassungsgericht urteilt</a></span>
<span style='color:#000000;font-weight:400;text-
decoration:none;vertical-align:baseline;font-size:11pt;font-
family:"Arial";font-style:normal'>,
dass eindeutig private Kommunikation von der Überwachung ausgenommen
sein muss</span></p>
It works well when I just try to extract the text from the xmlValues (XML-library) by using something like:
doc <- htmlParse(html, asText = TRUE)
text <- xpathSApply(doc, "//text()", xmlValue)
But in my case, I need to retain the links (<a>-tags) in the HTML file, and delete the https://www.google.com/url?q=-part. So I want to get rid of all styling and keep only the text + the link-tags.
What I tried so far
I tried to get both of the nodes by using //(p | a)in the XPath, but it didn't work.
回答1:
I prefer to use the rvest package instead of XML.
In this code I use the rvest package to parse the html and extract out the links from the page. Then using the stringr package I split the link text at the ?q= part and return the back half of the original link.
library(rvest)
library(stringr)
#Read html file,
page<-read_html("sample.txt")
#then find the link nodes, then extract the attribute text (ie the links)
link<-page%>% html_nodes("a") %>% html_attr( "href")
#return second string of first list element
#(Use sapply if there are more than 1 link in document)
desiredlink<-str_split(link, "\\?q=")[[1]][2]
#Find the text in all of the span nodes
span_text<-page%>% html_nodes("span") %>% html_text()
# or this for the text under the p nodes
p_text<-page%>% html_nodes("p") %>% html_text()
I have your sample html code from above saved to the file: "sample.txt"
来源:https://stackoverflow.com/questions/45862905/parsing-html-to-text-with-link-tags-remaining-in-r