I am using the R package tm and I want to do some text mining. This is one document and is treated as a bag of words.
I don\'t understand the documenta
The following assumes you have a directory of text files from which you want to create a bag of words.
The only change that needs to be made is replace
path = "C:\\windows\\path\\to\\text\\files\\
with your directory path.
library(tidyverse)
library(tidytext)
# create a data frame listing all files to be analyzed
all_txts <- list.files(path = "C:\\windows\\path\\to\\text\\files\\", # path can be relative or absolute
pattern = ".txt$", # this pattern only selects files ending with .txt
full.names = TRUE) # gives the file path as well as name
# create a data frame with one word per line
my_corpus <- map_dfr(all_txts, ~ tibble(txt = read_file(.x)) %>% # read in each file in list
mutate(filename = basename(.x)) %>% # add the file name as a new column
unnest_tokens(word, txt)) # split each word out as a separate row
# count the total # of rows/words in your corpus
my_corpus %>%
summarize(number_rows = n())
# group and count by "filename" field and sort descending
my_corpus %>%
group_by(filename) %>%
summarize(number_rows = n()) %>%
arrange(desc(number_rows))
# remove stop words
my_corpus2 <- my_corpus %>%
anti_join(stop_words)
# repeat the count after stop words are removed
my_corpus2 %>%
group_by(filename) %>%
summarize(number_rows = n()) %>%
arrange(desc(number_rows))