Regex - Matching text AFTER certain characters

后端未结

关注

 4  1469

I want to scrape data from some text and dump it into an array. Consider the following text as example data:

| Example Data
| Title: This is a sample title
|


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  我在风中等你        
                
              
                            
                2020-12-19 01:36
              
            
            
                                                                       
In Ruby, as in PCRE and Boost, you may make use of the \K match reset operator:

\K keeps the text matched so far out of the overall regex match. h\Kd matches only the second d in adhd.

So, you may use
/:[[:blank:]]*\K.+/     # To only match horizontal whitespaces with `[[:blank:]]`
/:\s*\K.+/              # To match any whitespace with `\s`

Seee the Rubular demo #1 and the Rubular demo #2 and
Details

: - a colon
[[:blank:]]* -  0 or more horizontal whitespace chars
\K - match reset operator discarding the text matched so far from the overall match memory buffer
.+ - matches and consumes any 1 or more chars other than line break chars (use /m modifier to match any chars including line break chars).

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  渐次进展        
                
              
                            
                2020-12-19 01:39
              
            
            
                                                                       
You could change it to:

/: (.+)/


and grab the contents of group 1. A lookbehind works too, though, and does just what you're asking:

/(?<=: ).+/

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  滥情空心        
                
              
                            
                2020-12-19 01:41
              
            
            
                                                                       
In addition to @minitech's answer, you can also make a 3rd variation:

/(?<=: ?)(.+)/


The difference here being, you create/grab the group using a look-behind.

If you still prefer the look-ahead rather than look-behind concept. . .

/(?=: ?(.+))/


This will place a grouping around your existing regex where it will catch it within a group.

And yes, the outside parenthesis  in your code will make a match. Compare that to the latter example I gave where the entire look-ahead is 'grouped' rather than needlessly using a /( ... )/ without the /(?= ... )/, since the first result in most regular expression engines return the entire matched string.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  执念已碎        
                
              
                            
                2020-12-19 01:42
              
            
            
                                                                       
I know you are asking for regex but I just saw the regex solution and found that it is rather hard to read for those unfamiliar with regex.

I'm also using Ruby and I decided to do it with:

line_as_string.split(": ")[-1]


This does what you require and IMHO it's far more readable.
For a very long string it might be inefficient. But not for this purpose.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复