Extract text using regex in R

后端 未结 5 1991
一整个雨季
一整个雨季 2021-01-25 02:15

I read the text file with below data and am trying to convert it to a dataframe

Id:   1
ASIN: 0827229534
  title: Patterns of Preaching: A Sermon Sampler
  group         


        
5条回答
  •  野性不改
    2021-01-25 02:45

    Using the tidyverse package:

    library(tidyverse)
    
    text <- list(readLines("https://raw.githubusercontent.com/pranavn91/PhD/master/Expt/sample.txt"))
    
    out <- tibble(text = text)
    
    out <- out %>%
      rowwise() %>%
      mutate(ids = str_extract(text,"Id: .+") %>% na.omit() %>% str_remove("Id: ") %>% str_c(collapse = ", "),
             ASIN = str_extract(text,"ASIN: .+") %>% na.omit() %>% str_remove("ASIN: ") %>% str_c(collapse = ", "),
             title = str_extract(text,"title: .+") %>% na.omit() %>% str_remove("title: ") %>% str_c(collapse = ", "),
             group = str_extract(text,"group: .+") %>% na.omit() %>% str_remove("group: ") %>% str_c(collapse = ", "),
             similar = str_extract(text,"similar: .+") %>% na.omit() %>% str_remove("similar: ") %>% str_c(collapse = ", "),
             rating = str_extract(text,"avg rating: .+") %>% na.omit() %>% str_remove("avg rating: ") %>% str_c(collapse = ", ")
             ) %>%
      ungroup()
    

    I put the text in a list because I assume that you will want to create a dataframe with more than one item being looked up. If you do just add a new list item for each readLines that you do.

    Notice that mutate looks at each item in the list as an object which is equivalent to using text[[1]]...

    If you have and item occur more than once you'll need to add %>% str_c(collapse = ", ") like I have done, otherwise you can remove it.

    UPDATE based on new sample data:

    The new sample dataset creates some different challenges that weren't addressed in my original answer.

    First, the data is all in a single file and I had assumed it would be in multiple files. It is possible to either separate everything into a list of lists, or to separate everything into a vector of characters. I chose the second option.

    Because I chose the second option I now have to update my code to extract data until a \r is reached (Need to \\r in R because of how R handles escapes).

    Next, some of the fields are empty! Have to add a check to see if the result is empty and fix the output if it is. I'm using %>% ifelse(length(.)==0,NA,.) to accomplish this.

    Note: if you add other fields such as categories: to this search the code will only capture the first line of text. It will need to be modified to capture more than one line.

    library(tidyverse)
    
    # Read text into a single long file.
    text <- read_file("https://raw.githubusercontent.com/pranavn91/PhD/master/Expt/sample.txt")
    
    # Separate each Id: into a character string in a vector
    # Use negative lookahead to capture groups that don't have Id: in them.
    # Use an or to also capture any non-words that don't have Id: in them.
    text <- str_extract_all(text,"Id: (((?!Id:).)|[^(Id:)])+") %>% 
      flatten()
    
    out <- tibble(text = text)
    
    out <- out %>%
      rowwise() %>%
      mutate(ids = str_extract(text,"Id: ((?!\\\\r).)+") %>% na.omit() %>% str_remove("Id: ") %>% str_c(collapse = ", ") %>% ifelse(length(.)==0,NA,.),
             ASIN = str_extract(text,"ASIN: ((?!\\\\r).)+") %>% na.omit() %>% str_remove("ASIN: ") %>% str_c(collapse = ", ") %>% ifelse(length(.)==0,NA,.),
             title = str_extract(text,"title: ((?!\\\\r).)+") %>% na.omit() %>% str_remove("title: ") %>% str_c(collapse = ", ") %>% ifelse(length(.)==0,NA,.),
             group = str_extract(text,"group: ((?!\\\\r).)+") %>% na.omit() %>% str_remove("group: ") %>% str_c(collapse = ", ") %>% ifelse(length(.)==0,NA,.),
             similar = str_extract(text,"similar: ((?!\\\\r).)+") %>% na.omit() %>% str_remove("similar: \\d") %>% str_c(collapse = ", ") %>% ifelse(length(.)==0,NA,.),
             rating = str_extract(text,"avg rating: ((?!\\\\r).)+") %>% na.omit() %>% str_remove("avg rating: ") %>% str_c(collapse = ", ") %>% ifelse(length(.)==0,NA,.)
      ) %>%
      ungroup()
    

提交回复
热议问题