Word substitution within tidy text format

江枫思渺然 提交于 2021-02-07 20:22:08

问题


Hi i'm working with a tidy_text format and i am trying to substitute the strings "emails" and "emailing" into "email".

set.seed(123)
terms <- c("emails are nice", "emailing is fun", "computer freaks", "broken modem")
df <- data.frame(sentence = sample(terms, 100, replace = TRUE))
df
str(df)
df$sentence <- as.character(df$sentence)
tidy_df <- df %>% 
unnest_tokens(word, sentence)

tidy_df %>% 
count(word, sort = TRUE) %>% 
filter( n > 20) %>% 
mutate(word = reorder(word, n)) %>% 
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) + 
coord_flip()

this works fine, but when i use:

 tidy_df <- gsub("emailing", "email", tidy_df)

to substitute words and run the bar chart again i get the following error message:

Error in UseMethod("group_by_") : no applicable method for 'group_by_' applied to an object of class "character"

Does any one know how to easily substitute words within tidy text formats without changing structure/class of the tidy_text?


回答1:


Removing the ends of words like that is called stemming and there are a couple of packages in R that will do that for you, if you'd like. One is the hunspell package from rOpenSci, and another option is the SnowballC package which implements Porter algorithm stemming. You would implement that like so:

library(dplyr)
library(tidytext)
library(SnowballC)

terms <- c("emails are nice", "emailing is fun", "computer freaks", "broken modem")

set.seed(123)
data_frame(txt = sample(terms, 100, replace = TRUE)) %>%
        unnest_tokens(word, txt) %>%
        mutate(word = wordStem(word))
#> # A tibble: 253 × 1
#>      word
#>     <chr>
#> 1   email
#> 2       i
#> 3     fun
#> 4  broken
#> 5   modem
#> 6   email
#> 7       i
#> 8     fun
#> 9  broken
#> 10  modem
#> # ... with 243 more rows

Notice that it is stemming all your text and that some of the words don't look like real words anymore; you may or may not care about that.

If you don't want to stem all your text using a stemmer like SnowballC or hunspell, you can use dplyr's if_else within mutate() to replace just specific words.

set.seed(123)
data_frame(txt = sample(terms, 100, replace = TRUE)) %>%
        unnest_tokens(word, txt) %>%
        mutate(word = if_else(word %in% c("emailing", "emails"), "email", word))
#> # A tibble: 253 × 1
#>      word
#>     <chr>
#> 1   email
#> 2      is
#> 3     fun
#> 4  broken
#> 5   modem
#> 6   email
#> 7      is
#> 8     fun
#> 9  broken
#> 10  modem
#> # ... with 243 more rows

Or it might make more sense for you to use str_replace from the stringr package.

library(stringr)
set.seed(123)
data_frame(txt = sample(terms, 100, replace = TRUE)) %>%
        unnest_tokens(word, txt) %>%
        mutate(word = str_replace(word, "email(s|ing)", "email"))
#> # A tibble: 253 × 1
#>      word
#>     <chr>
#> 1   email
#> 2      is
#> 3     fun
#> 4  broken
#> 5   modem
#> 6   email
#> 7      is
#> 8     fun
#> 9  broken
#> 10  modem
#> # ... with 243 more rows


来源:https://stackoverflow.com/questions/43344108/word-substitution-within-tidy-text-format

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!