Keeping Turkish characters with the text mining package for R

馋奶兔 提交于 2019-12-11 05:01:01

问题


let me start this by saying that I'm still pretty much a beginner with R. Currently I am trying out basic text mining techniques for Turkish texts, using the tm package. I have, however, encountered a problem with the display of Turkish characters in R.

Here's what I did:

docs <- VCorpus(DirSource("DIRECTORY", encoding = "UTF-8"), readerControl = list(language = "tur"))
writeLines(as.character(docs), con="documents.txt")

My thinking being, that setting the language to Turkish and the encoding to UTF-8 (which is the original encoding of the text files) should make the display of the Turkish characters İ, ı, ğ, Ğ, ş and Ş possible. Instead the output converts these charaters to I, i, g, G, s and S respectively and saves it to an ANSI-Encoding, which cannot display these characters.

writeLines(as.character(docs), con="documents.txt", Encoding("UTF-8"))

also saves the file without the characters in ANSI encoding.

This seems to not only be an issue with the output file.

writeLines(as.character(docs[[1]])

for example yields a line that should read "Okul ve cami açılışları umutları artırdı" but instead reads "Okul ve cami açilislari umutlari artirdi"

After reading this: UTF-8 file output in R I also tried the following code:

writeLines(as.character(docs), con="documents.txt", Encoding("UTF-8"), useBytes=T)

which didn't change the results.

All of this is on Windows 7 with both the most recent version of R and RStudio.

Is there a way to fix this? I am probably missing something obvious, but any help would be appreciated.


回答1:


Here is how I keep the Turkish characters intact:

  1. Open a new .Rmd file in RStudio. (RStudio -> File -> New File -> R Markdown)
  2. Copy and Paste your text containing Turkish characters.
  3. Save the .Rmd file with encoding. (RStudio -> File -> Save with Encoding.. -> UTF-8)
  4. yourdocument <- readLines("yourdocument.Rmd", encoding = "UTF-8")
  5. yourdocument <- paste(yourdocument, collapse = " ")
  6. After this step you can create your corpus
  7. e.g start from VectorSource() in tm package.
  8. Turkish characters will appear as they should.


来源:https://stackoverflow.com/questions/47944331/keeping-turkish-characters-with-the-text-mining-package-for-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!