parsing html containing   (non-breaking space)

前端 未结 6 1677
轻奢々
轻奢々 2020-12-16 23:19

I am using rvest to parse a website. I\'m hitting a wall with these little non-breaking spaces. How does one remove the whitespace that is created by the

6条回答
  •  盖世英雄少女心
    2020-12-17 00:10

    Posting this since I think it's the most robust approach.

    I scraped a Wikipedia page and got this in my output (not sure if it'll copy-paste properly):

    x <- " California"
    

    And gsub("\\s", "", x) didn't change anything, which raised the flag that something fishy is going on.

    To investigate, I did:

    dput(charToRaw(strsplit(x, "")[[1]][1]))
    # as.raw(c(0xc2, 0xa0))
    

    To figure out how exactly that character is stored/recognized in memory.

    With this in hand, we can use gsub a bit more robustly than in the other solutions:

    gsub(rawToChar(as.raw(c(0xc2, 0xa0))), "", x)
    # [1] "California"
    

    (@MrFlick's suggestion to set the encoding didn't work for me, and it's not clear where @shabbychef got the input 160 for intToUtf8; this approach can be generalized to other similar situations)

提交回复
热议问题