Splitting strings through regular expressions by punctuation and whitespace etc in java

后端未结

关注

 4  1146

I have this text file that I read into a Java application and then count the words in it line by line. Right now I am splitting the lines into words by a

St


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  终归单人心        
                
              
                            
                2020-12-01 12:43
              
            
            
                                                                       
You have one small mistake in your regex. Try this:

String[] Res = Text.split("[\\p{Punct}\\s]+");


[\\p{Punct}\\s]+ move the + form inside the character class to the outside. Other wise you are splitting also on a + and do not combine split characters in a row.

So I get for this code

String Text = "But I know. For example, the word \"can\'t\" should";

String[] Res = Text.split("[\\p{Punct}\\s]+");
System.out.println(Res.length);
for (String s:Res){
    System.out.println(s);
}


this result


  10

  But

  I

  know

  For

  example

  the

  word

  can

  t

  should  


Which should meet your requirement.

As an alternative you can use

String[] Res = Text.split("\\P{L}+");


\\P{L} means is not a unicode code point that has the property "Letter"
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  暖寄归人        
                
              
                            
                2020-12-01 12:44
              
            
            
                                                                       
Well, seeing you want to count can't as two words , try 

split("\\b\\w+?\\b")


http://www.regular-expressions.info/wordboundaries.html
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  感情败类        
                
              
                            
                2020-12-01 12:48
              
            
            
                                                                       
Try:

line.split("[\\.,\\s!;?:\"]+");
or         "[\\.,\\s!;?:\"']+"


This is an or match of one of these characters: ., !;?:"' (note that there is a space in there but no / or \) the + causes several chars together to be counted as one.

That should give you a mostly sufficient accuracy.
More precise regexes would need more information about the type of text you need to parse, because ' can be a word delimiter as well. Mostly the most punctuation word delimiters are around a whitespace so matching on [\\s]+ would be a close approximation as well. (but gives the wrong count on short quotations like: She said:"no".)
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  名媛妹妹        
                
              
                            
                2020-12-01 12:57
              
            
            
                                                                       
There's a non-word literal, \W, see Pattern.

String line = "Hello! this is a line. It can't be hard to split into \"words\", can it?";
String[] words = line.split("\\W+");
for (String word : words) System.out.println(word);


gives

Hello
this
is
a
line
It
can
t
be
hard
to
split
into
words
can
it

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复