R text file and text mining…how to load data

后端 未结 6 1076
星月不相逢
星月不相逢 2020-12-13 10:51

I am using the R package tm and I want to do some text mining. This is one document and is treated as a bag of words.

I don\'t understand the documenta

相关标签:
6条回答
  • 2020-12-13 11:19

    I actually found this quite tricky to begin with, so here's a more comprehensive explanation.

    First, you need to set up a source for your text documents. I found that the easiest way (especially if you plan on adding more documents, is to create a directory source that will read all of your files in.

    source <- DirSource("yourdirectoryname/") #input path for documents
    YourCorpus <- Corpus(source, readerControl=list(reader=readPlain)) #load in documents
    

    You can then apply the StemDocument function to your Corpus. HTH.

    0 讨论(0)
  • 2020-12-13 11:25

    Here's my solution for a text file with a line per observation. the latest vignette on tm (Feb 2017) gives more detail.

    text <- read.delim(textFileName, header=F, sep = "\n",stringsAsFactors = F)
    colnames(text) <- c("MyCol")
    docs <- text$MyCol
    a <- VCorpus(VectorSource(docs))
    
    0 讨论(0)
  • 2020-12-13 11:36

    Can't you just use the function readPlain from the same library? Or you could just use the more common scan function.

    mydoc.txt <-scan("./mydoc.txt", what = "character")
    
    0 讨论(0)
  • 2020-12-13 11:37

    I believe what you wanted to do was read individual file into a corpus and then make it treat the different rows in the text file as different observations.

    See if this gives you what you want:

    text <- read.delim("this is a test for R load.txt", sep = "/t")
    text_corpus <- Corpus(VectorSource(text), readerControl = list(language = "en"))
    

    This is assuming that the file "this is a test for R load.txt" has only one column which has the text data.

    Here the "text_corpus" is the object that you are looking for.

    Hope this helps.

    0 讨论(0)
  • 2020-12-13 11:39

    Like @richiemorrisroe I found this poorly documented. Here's how I get my text in to use with the tm package and make the document term matrix:

    library(tm) #load text mining library
    setwd('F:/My Documents/My texts') #sets R's working directory to near where my files are
    a  <-Corpus(DirSource("/My Documents/My texts"), readerControl = list(language="lat")) #specifies the exact folder where my text file(s) is for analysis with tm.
    summary(a)  #check what went in
    a <- tm_map(a, removeNumbers)
    a <- tm_map(a, removePunctuation)
    a <- tm_map(a , stripWhitespace)
    a <- tm_map(a, tolower)
    a <- tm_map(a, removeWords, stopwords("english")) # this stopword file is at C:\Users\[username]\Documents\R\win-library\2.13\tm\stopwords 
    a <- tm_map(a, stemDocument, language = "english")
    adtm <-DocumentTermMatrix(a) 
    adtm <- removeSparseTerms(adtm, 0.75)
    

    In this case you don't need to specify the exact file name. So long as it's the only one in the directory referred to in line 3, it will be used by the tm functions. I do it this way because I have not had any success in specifying the file name in line 3.

    If anyone can suggest how to get text into the lda package I'd be most grateful. I haven't been able to work that out at all.

    0 讨论(0)
  • 2020-12-13 11:41

    The following assumes you have a directory of text files from which you want to create a bag of words.

    The only change that needs to be made is replace path = "C:\\windows\\path\\to\\text\\files\\ with your directory path.

    library(tidyverse)
    library(tidytext)
    
    # create a data frame listing all files to be analyzed
    all_txts <- list.files(path = "C:\\windows\\path\\to\\text\\files\\",   # path can be relative or absolute
                           pattern = ".txt$",  # this pattern only selects files ending with .txt
                           full.names = TRUE)  # gives the file path as well as name
    
    # create a data frame with one word per line
    my_corpus <- map_dfr(all_txts, ~ tibble(txt = read_file(.x)) %>%   # read in each file in list
                          mutate(filename = basename(.x)) %>%   # add the file name as a new column
                          unnest_tokens(word, txt))   # split each word out as a separate row
    
    # count the total # of rows/words in your corpus
    my_corpus %>%
      summarize(number_rows = n())
    
    # group and count by "filename" field and sort descending
    my_corpus %>%
      group_by(filename) %>%
      summarize(number_rows = n()) %>%
      arrange(desc(number_rows))
    
    # remove stop words
    my_corpus2 <- my_corpus %>%
      anti_join(stop_words)
    
    # repeat the count after stop words are removed
    my_corpus2 %>%
      group_by(filename) %>%
      summarize(number_rows = n()) %>%
      arrange(desc(number_rows))
    
    0 讨论(0)
提交回复
热议问题