问题
The method html_text()
(from R Package rvest) concatenates the text of the node and all its children. I would like to extract only the father's text.
Forthe following example, html_text()
gives HELLO GOODBYE.
I want to get just GOODBYE. How can I get it?
<div class="joke">
<div class="div_inside">
<div class="title_inside">
<a class="link" href="sompage.htm">HELLO</a>
</div>
</div>
GOODBYE
</div>
回答1:
Try to grab the main div
tag with class
"joke" without picking up its children, using xpath:
library(rvest)
read_html('your_html_script') %>%
html_nodes(xpath = '//div[@class="joke"]/node()[not(self::div)]') %>%
html_text()
Thanks!
来源:https://stackoverflow.com/questions/39506292/rvest-getting-node-text-and-not-its-childens-text