Rvest: getting node text and not its childen's text

风流意气都作罢 提交于 2019-12-12 19:13:28

问题


The method html_text() (from R Package rvest) concatenates the text of the node and all its children. I would like to extract only the father's text.

Forthe following example, html_text() gives HELLO GOODBYE.

I want to get just GOODBYE. How can I get it?

<div class="joke">
  <div class="div_inside">
    <div class="title_inside">
      <a class="link" href="sompage.htm">HELLO</a>
    </div>
  </div>
  GOODBYE
</div>

回答1:


Try to grab the main div tag with class "joke" without picking up its children, using xpath:

library(rvest)

read_html('your_html_script') %>%
    html_nodes(xpath = '//div[@class="joke"]/node()[not(self::div)]') %>% 
    html_text()

Thanks!



来源:https://stackoverflow.com/questions/39506292/rvest-getting-node-text-and-not-its-childens-text

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!