How to remove unicode <U+00A6> from string?

随声附和 提交于 2019-11-27 09:44:42
Wiktor Stribiżew

I just want to remove unicode <U+00A6> which is at the beginning of string.

Then you do not need a gsub, you can use a sub with "^\\s*<U\\+\\w+>\\s*" pattern:

q <-"<U+00A6>  1000-66329"
sub("^\\s*<U\\+\\w+>\\s*", "", q)

Pattern details:

  • ^ - start of string
  • \\s* - zero or more whitespaces
  • <U\\+ - a literal char sequence <U+
  • \\w+ - 1 or more letters, digits or underscores
  • > - a literal >
  • \\s* - zero or more whitespaces.

If you also need to replace the - with a space, add |- alternative and use gsub (since now we expect several replacements and the replacement must be a space - same is in akrun's answer):

trimws(gsub("^\\s*<U\\+\\w+>|-", " ", q))

See the R online demo

We can also do

trimws(gsub("\\S+\\s+|-", " ", q))
#[1] "1000 66329"

If always is the first character, you can try:

substring("\U00A6 1000-66B29", 2)

if R prints the string as <U+00A6> 1000-66329 instead of ¦ 1000-66B29 then <U+00A6> is interpreted as the string "<U+00A6>" instead of the unicode character. Then you can do:

substring("<U+00A6>  1000-66329",9)

Both ways the result is:

[1] "  1000-66329"

Instead of removing you should convert it to the appropriate format ... You have to set your local to UTF-8 like so:

Sys.setlocale("LC_CTYPE", "en_US.UTF-8")

Maybe you will see the following message:

Warning message:
In Sys.setlocale("LC_CTYPE", "en_US.UTF-8") :
  OS reports request to set locale to "en_US.UTF-8" cannot be honored

In this case you should use stringi::stri_trans_general(x, "zh")

Here "zh" means "chinese". You should know which language you have to convert to. That's it

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!