R text file and text mining…how to load data

后端未结

关注

 6  1083

星月不相逢 2020-12-13 10:51

I am using the R package tm and I want to do some text mining. This is one document and is treated as a bag of words.

I don\'t understand the documenta

6条回答

青春惊慌失措 (楼主)

2020-12-13 11:41

The following assumes you have a directory of text files from which you want to create a bag of words.

The only change that needs to be made is replace path = "C:\\windows\\path\\to\\text\\files\\ with your directory path.

library(tidyverse)
library(tidytext)

# create a data frame listing all files to be analyzed
all_txts <- list.files(path = "C:\\windows\\path\\to\\text\\files\\",   # path can be relative or absolute
                       pattern = ".txt$",  # this pattern only selects files ending with .txt
                       full.names = TRUE)  # gives the file path as well as name

# create a data frame with one word per line
my_corpus <- map_dfr(all_txts, ~ tibble(txt = read_file(.x)) %>%   # read in each file in list
                      mutate(filename = basename(.x)) %>%   # add the file name as a new column
                      unnest_tokens(word, txt))   # split each word out as a separate row

# count the total # of rows/words in your corpus
my_corpus %>%
  summarize(number_rows = n())

# group and count by "filename" field and sort descending
my_corpus %>%
  group_by(filename) %>%
  summarize(number_rows = n()) %>%
  arrange(desc(number_rows))

# remove stop words
my_corpus2 <- my_corpus %>%
  anti_join(stop_words)

# repeat the count after stop words are removed
my_corpus2 %>%
  group_by(filename) %>%
  summarize(number_rows = n()) %>%
  arrange(desc(number_rows))

0 讨论(0)

查看其它6个回答