Extract text using regex in R

后端未结

关注

 5  1988

一整个雨季 2021-01-25 02:15

I read the text file with below data and am trying to convert it to a dataframe

Id:   1
ASIN: 0827229534
  title: Patterns of Preaching: A Sermon Sampler
  group


      
      
        
          5条回答        

        
                    
            
            
                         
                
              
              
                
                   南笙
                                             
                
                
                (楼主)
            
              
              
                2021-01-25 02:47
              

            
            
                        
Here is a different approach using separate_rows and spread to reformat the text file into a dataframe:

text = readLines(path_to_textfile)

library(dplyr)
library(tidyr)

data.frame(text = text) %>%
  separate_rows(text, sep = "(?<=\\d)\\s+(?=[a-z])") %>%
  extract(text, c("title", "value"), regex = "(?i)([a-z]+):(.+)") %>%
  filter(!title %in% c("reviews", "downloaded")) %>%
  group_by(title) %>%
  mutate(id = 1:n()) %>%
  spread(title, value) %>%
  select(-id)


Result:

         ASIN group   Id rating salesrank
1  0827229534  Book    1      5    396585
2    12412441  Book    2     10   4225352
                                                         similar
1  5  0804215715  156101074X  0687023955  0687074231  082721619X
2                                         1241242 1412414 124124
                                     title
1  Patterns of Preaching: A Sermon Sampler
2                                Patterns2


Data:

Id:   1
ASIN: 0827229534
  title: Patterns of Preaching: A Sermon Sampler
  group: Book
  salesrank: 396585
  similar: 5  0804215715  156101074X  0687023955  0687074231  082721619X
  reviews: total: 2  downloaded: 2  avg rating: 5
Id:   2
ASIN: 12412441
  title: Patterns2
  group: Book
  salesrank: 4225352
  similar: 1241242 1412414 124124
  reviews: total: 2  downloaded: 2  avg rating: 10


Note:

Leave an extra blank row at the end of the text file. Otherwise readLines would return an error when attempting to read in the file.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它5个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复