gsub in R with unicode replacement give different results under Windows compared with Unix?

后端 未结 2 1303
悲&欢浪女
悲&欢浪女 2020-12-04 00:34

Running the following commands in R under Mac or Linux produces the expected result, that is the greek letter beta:

gsub(\"\", \"\\u         


        
相关标签:
2条回答
  • 2020-12-04 01:03

    Just to elaborate on @MrFlick's solution, you have to set the encoding after each time a string is processed by gsub, as in:

    s <- "blah<U+03B2>blah-blah<U+03B2>blah-blah<U+03B2>blah"
    # setting the encoding here and not in the while loop will not fix the problem
    {
    while(grepl('<U\\+[0-9A-Fa-f]{4}>',s)){
        newVal <- gsub('^.*<U\\+([0-9A-Fa-f]{4})>.*$','"\\\\u\\1"',s)
        newVal <- eval(parse(text=newVal))
        cat(newVal,'\n')
        s <- gsub('^(.*)<U\\+[0-9A-Fa-f]{4}>(.*)$',
                  paste0('\\1',newVal,'\\2'),
                  s)
        # setting the encoding here fixes the cross platform differences
        Encoding(s) <- 'UTF-8'
    }
    cat(s,'\n')
    # setting the encoding here and not in the while loop will raise an error
    }
    Encoding(s)
    
    0 讨论(0)
  • 2020-12-04 01:16

    If you're not seeing the right character on Windows, try explicitly setting the encoding

    x <- gsub("<U\\+[0-9A-F]{4}>", "\u03B2", "<U+03B2>")
    Encoding(x) <- "UTF-8"
    x
    

    As far as replacing all such symbols with unicode characters, i've adapted this answer to do a similar thing. Here we build the unicode character as a raw vector. Here's a helper function

    trueunicode <- function(x) {
        packuni<-Vectorize(function(cp) {
            bv <- intToBits(cp)
            maxbit <- tail(which(bv!=as.raw(0)),1)
            if(maxbit < 8) {
                rawToChar(as.raw(codepoint))
            } else if (maxbit < 12) {
                rawToChar(rev(packBits(c(bv[1:6], as.raw(c(0,1)), bv[7:11], as.raw(c(0,1,1))), "raw")))
            } else if (maxbit < 17){
                rawToChar(rev(packBits(c(bv[1:6], as.raw(c(0,1)), bv[7:12], as.raw(c(0,1)), bv[13:16], as.raw(c(0,1,1,1))), "raw")))    
            } else {
               stop("too many bits")
            }
        })
        m <- gregexpr("<U\\+[0-9a-fA-F]{4}>", x)
        codes <- regmatches(x,m)
        chars <- lapply(codes, function(x) {
            codepoints <- strtoi(paste0("0x", substring(x,4,7)))
            packuni(codepoints)
    
        })
        regmatches(x,m) <- chars
        Encoding(x)<-"UTF-8"
        x
    }
    

    and then we can use it like

    x <- c("beta <U+03B2>", "flipped e <U+018F>!", "<U+2660> <U+2663> <U+2665> <U+2666>")
    trueunicode(x)
    # [1] "beta β"       "flipped e Ə!" "♠ ♣ ♥ ♦"
    
    0 讨论(0)
提交回复
热议问题