How can I tokenize a text column in R? unnest function not working

末鹿安然 提交于 2021-02-10 04:01:27

问题


I am a new R user. Will really appreciate if you can help me with solving the tokenization problem:

My task in brief: I am trying to import a text file in into R. One of the text columns is Headline. The dataset is basically a collection of news articles related to a disease.

Issue: I have tried many times to tokenize it using the unnest_tokens function.

It is showing me the following error messages:

Error in UseMethod("unnest_tokens_") : no applicable method for 'unnest_tokens_' applied to an object of class "character"

Error in unnest_tokens(word, Headline) : object 'word' not found

library(dplyr)
library(tidytext)

DengueNews %>%
unnest_tokens(word, Headline)

Note: Link of the dataset:https://drive.google.com/file/d/18VWg-2sO11GpwxMGF1UbziodoWK9B9Ru/view?usp=sharing I am following the instructions from https://www.tidytextmining.com/tidytext.html


回答1:


It is not clear how the data was read. As mentioned in the comments, if the data column 'Headline' is character class, it should work. Here, we use read_excl from readxl package to read the dataset. By default, columns that are character will be returned with character class attribute.

library(readxl)
library(tidytext)
DengueNews <- read_excel("DengueNews.xlsx")
class(DengueNew$Headline)
#[1] "character"

DengueNews %>%
  unnest_tokens(word, Headline)
# A tibble: 217 x 4
   Serial Date  Newscontent                                                                                                                                             word      
    <dbl> <chr> <chr>                                                                                                                                                   <chr>     
 1    216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… dghs      
 2    216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… 491       
 3    216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… more      
 4    216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… hospitali…
 5    216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… for       
 6    216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… dengue    
 7    216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… in        
 8    216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… 24hrs     
 9    215 43725 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA fifth-grader schoolgirl has died of dengue fever at Dhaka Medical College a… 1         
10    215 43725 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA fifth-grader schoolgirl has died of dengue fever at Dhaka Medical College a… more      
# … with 207 more rows

If we change the column class to another class factor, it would fail

library(dplyr)
DengueNews %>%
   mutate(Headline = factor(Headline)) %>%
   unnest_tokens(word, Healine)


来源:https://stackoverflow.com/questions/58051557/how-can-i-tokenize-a-text-column-in-r-unnest-function-not-working

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!