How can I process Chinese/ Japanese characters with R [closed]

▼魔方 西西 提交于 2019-12-08 12:56:28

EDIT:

It looks like R has a hard time reading in non-English characters in as text. I tried scraping the Chinese alphabet from the web and got a result that may help, if character encoding is consistent.

### Require package used to parse HTML Contents of a web page
require(XML)
### Open an internet connection
url <- url('http://www.chinese-tools.com/characters/alphabet.html')
### Read in Content line by line
page <- readLines(url, encoding = "UTF-8")
### Parse HTML Code
page <- htmlParse(page)
### Create a list of tables
page <- readHTMLTable(page)
### The alphabet is contained in the third table of the page
alphabet <- as.data.frame(page[3])

You now have a list of US Alphabet characters, with another column corresponding to how these characters have been read into R. If they were read in the same way in your original object that you wish to text mine, would it be possible to use Regular Expressions to search for these encoded characters one at a time?

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!