Extract text using regex in R

后端 未结 5 1988
一整个雨季
一整个雨季 2021-01-25 02:15

I read the text file with below data and am trying to convert it to a dataframe

Id:   1
ASIN: 0827229534
  title: Patterns of Preaching: A Sermon Sampler
  group         


        
5条回答
  •  南笙
    南笙 (楼主)
    2021-01-25 02:47

    Here is a different approach using separate_rows and spread to reformat the text file into a dataframe:

    text = readLines(path_to_textfile)
    
    library(dplyr)
    library(tidyr)
    
    data.frame(text = text) %>%
      separate_rows(text, sep = "(?<=\\d)\\s+(?=[a-z])") %>%
      extract(text, c("title", "value"), regex = "(?i)([a-z]+):(.+)") %>%
      filter(!title %in% c("reviews", "downloaded")) %>%
      group_by(title) %>%
      mutate(id = 1:n()) %>%
      spread(title, value) %>%
      select(-id)
    

    Result:

             ASIN group   Id rating salesrank
    1  0827229534  Book    1      5    396585
    2    12412441  Book    2     10   4225352
                                                             similar
    1  5  0804215715  156101074X  0687023955  0687074231  082721619X
    2                                         1241242 1412414 124124
                                         title
    1  Patterns of Preaching: A Sermon Sampler
    2                                Patterns2
    

    Data:

    Id:   1
    ASIN: 0827229534
      title: Patterns of Preaching: A Sermon Sampler
      group: Book
      salesrank: 396585
      similar: 5  0804215715  156101074X  0687023955  0687074231  082721619X
      reviews: total: 2  downloaded: 2  avg rating: 5
    Id:   2
    ASIN: 12412441
      title: Patterns2
      group: Book
      salesrank: 4225352
      similar: 1241242 1412414 124124
      reviews: total: 2  downloaded: 2  avg rating: 10
    

    Note:

    Leave an extra blank row at the end of the text file. Otherwise readLines would return an error when attempting to read in the file.

提交回复
热议问题