R Read & Parse HTML to List

China☆狼群 提交于 2019-12-11 05:53:25

问题


I have been trying to read & parse a bit of HTML to obtain a list of conditions for animals at an animal shelter. I'm sure my inexperience with HTML parsing isn't helping, but I seem to be getting no where fast.

Here's a snippet of the HTML:

<select multiple="true" name="asilomarCondition" id="asilomarCondition">

    <option value="101">
        Behavior- Aggression, Confrontational-Toward People (mild)
        -
        TM</option>
....
</select>

There's only one tag with <select...> and the rest are all <option value=x>.

I've been using the XML library. I can remove the newlines and tabs, but haven't had any success removing the tags:

conditions.html <- paste(readLines("Data/evalconditions.txt"), collapse="\n")
conditions.text <- gsub('[\t\n]',"",conditions.html)

As a final result, I'd like a list of all of the conditions that I can process further for later use as factor names:

Behavior- Aggression, Confrontational-Toward People (mild)-TM
Behavior- Aggression, Confrontational-Toward People (moderate/severe)-UU
...

I'm not sure if I need to use the XML library (or another library) or if gsub patterns would be sufficient (either way, I need to work out how to use it).


回答1:


Here is a start using the rvest package:

library(rvest)
#read the html page
page<-read_html("test.html")
#get the text from the "option" nodes and then trim the whitespace
nodes<-trimws(html_text(html_nodes(page, "option")))

#nodes will need additional clean up to remove the excessive spaces 
#and newline characters
nodes<-gsub("\n", "", nodes)
nodes<-gsub("  ", "", nodes)

The vector nodes should be the result which you requested. This example is based on the limited sample provided above, this the actual page may have unexpected results.



来源:https://stackoverflow.com/questions/38907455/r-read-parse-html-to-list

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!