How to remove strange characters using gsub in R?

前端 未结 2 867
不思量自难忘°
不思量自难忘° 2020-12-06 07:50

I\'m trying to clean up some text that was loaded into memory using readLines(..., encoding=\'UTF-8\').

If I don\'t specify the encoding, I see all kind

相关标签:
2条回答
  • 2020-12-06 08:28

    If you want to use regular expressions, you can keep only those characters you want using a range of ASCII codes:

    text = "The way I talk to my family......i would get my ass beat to 
    DEATH....but they kno I cray cray & just leave it at that 😜ðŸ˜â˜º'"
    
    gsub('[^\x20-\x7E]', '', text)
    
    # [1] "The way I talk to my family......i would get my ass beat to DEATH....but they kno I cray cray & just leave it at that '"
    

    Below is a table of ASCII codes taken from asciitable.com:

    You can see that I am removing any character not within the range of x20 (SPACE) and x7E (~).

    0 讨论(0)
  • 2020-12-06 08:38

    The easiest way to get rid of these characters is to convert from utf-8 to ascii:

    combined_doc <- iconv(combined_doc, 'utf-8', 'ascii', sub='')
    
    0 讨论(0)
提交回复
热议问题