parsing html containing   (non-breaking space)

前端 未结 6 1678
轻奢々
轻奢々 2020-12-16 23:19

I am using rvest to parse a website. I\'m hitting a wall with these little non-breaking spaces. How does one remove the whitespace that is created by the

6条回答
  •  猫巷女王i
    2020-12-17 00:11

    The   stands for "non-breaking space" which, in the unicode space, has it's own distinct character from a "regular" space (ie " "). Compare

    charToRaw(" foo")
    # [1] 20 66 6f 6f
    charToRaw(bodytext)
    # [1] c2 a0 66 6f 6f
    

    So you'd want to use one of the special character classes for white space. You can remove all white spaces with

    gsub("\\s", "", bodytext)
    

    On Windows, I needed to make sure the encoding of the string was set properly

    Encoding(bodytext) <- "UTF-8"
    gsub("\\s", "", bodytext)
    

提交回复
热议问题