I\'m trying to clean up some text that was loaded into memory using readLines(..., encoding=\'UTF-8\').
If I don\'t specify the encoding, I see all kind
If you want to use regular expressions, you can keep only those characters you want using a range of ASCII codes:
text = "The way I talk to my family......i would get my ass beat to
DEATH....but they kno I cray cray & just leave it at that 😜ðŸ˜â˜º'"
gsub('[^\x20-\x7E]', '', text)
# [1] "The way I talk to my family......i would get my ass beat to DEATH....but they kno I cray cray & just leave it at that '"
Below is a table of ASCII codes taken from asciitable.com:
You can see that I am removing any character not within the range of x20 (SPACE) and x7E (~).
The easiest way to get rid of these characters is to convert from utf-8 to ascii:
combined_doc <- iconv(combined_doc, 'utf-8', 'ascii', sub='')