rvest package read_html() function stops reading at “<” symbol

北城余情 提交于 2020-01-05 08:57:52

问题


I was wondering if this behavior is intentional in the rvest package. When rvest sees the < character it stops reading the HTML.

library(rvest)
read_html("<html><title>under 30 years = < 30 years <title></html>")

Prints:

[1] <head>\n  <title>under 30 = </title>\n</head>

If this is intentional, is there a workaround?


回答1:


Yes, it is normal for rvest because it's normal for html.

See the w3schools HTML Entities page. < and > are reserved characters in html and their literal values have to be written another way, as specific character entities. Here is the entity table from the linked page, giving some commonly used html characters and their respective html entities.

XML::readHTMLTable("http://www.w3schools.com/html/html_entities.asp", which = 2)
#    Result          Description Entity Name Entity Number
# 1           non-breaking space      &nbsp;        &#160;
# 2       <            less than        &lt;         &#60;
# 3       >         greater than        &gt;         &#62;
# 4       &            ampersand       &amp;         &#38;
# 5       ¢                 cent      &cent;        &#162;
# 6       £                pound     &pound;        &#163;
# 7       ¥                  yen       &yen;        &#165;
# 8       €                 euro      &euro;       &#8364;
# 9       ©            copyright      &copy;        &#169;
# 10      ® registered trademark       &reg;        &#174;

So you will have to replace those values, perhaps with gsub() or manually if there aren't too many. You can see that it will parse properly when those characters are replaced with the correct entity.

library(XML)
doc <- htmlParse("<html><title>under 30 years = &lt; 30 years </title></html>")
xmlValue(doc["//title"][[1]])
# [1] "under 30 years = < 30 years "

You could use gsub(), something like the following

txt <- "<html><title>under 30 years = < 30 years </title></html>"
xmlValue(htmlParse(gsub(" < ", " &lt; ", txt, fixed = TRUE))["//title"][[1]])
# [1] "under 30 years = < 30 years "

I used the XML package here, but the same applies for other packages that process html.



来源:https://stackoverflow.com/questions/33447676/rvest-package-read-html-function-stops-reading-at-symbol

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!