remove stop words from text C#

后端未结

关注

 6  1986

忘掉有多难 2020-12-21 15:41

i want to remove an array of stop words from input string, and I have the following procedure

string[] arrToCheck = new string[] { \"try \", \"yourself\", \


      
      
        
          6条回答        

        
                    
            
            
                         
                
              
              
                
                   太阳男子
                                             
                
                
                (楼主)
            
              
              
                2020-12-21 15:57
              

            
            
                        
Here you go:

var words_to_remove = new HashSet { "try", "yourself", "before" };
string input = "Did you try this yourself before asking";

string output = string.Join(
    " ",
    input
        .Split(new[] { ' ', '\t', '\n', '\r' /* etc... */ })
        .Where(word => !words_to_remove.Contains(word))
);

Console.WriteLine(output);


This prints:

Did you this asking


The HashSet provides extremely quick lookups, so 450 elements in words_to_remove should be no problem at all. Also, we are traversing the input string only once (instead of once per word to remove as in your example).

However, if the input string is very long, there are ways to make this more memory efficient (if not quicker), by not holding the split result in memory all at once.

To remove not just "do" but "doing", "does" etc... you'll have to include all these variants in the words_to_remove. If you wanted to remove prefixes in a general way, this would be possible to do (relatively) efficiently using a trie of words to remove (or alternatively a suffix tree of input string), but what to do when "do" is not a prefix of something that should be removed, such as "did"? Or when it is prefix of something that shouldn't be removed, such as "dog"?

BTW, to remove words no matter their case, simply pass the appropriate case-insensitive comparer to HashSet constructor, for example StringComparer.CurrentCultureIgnoreCase.

--- EDIT ---

Here is another alternative:

var words_to_remove = new[] { " ", "try", "yourself", "before" }; // Note the space!
string input = "Did you try this yourself before asking";

string output = string.Join(
    " ",
    input.Split(words_to_remove, StringSplitOptions.RemoveEmptyEntries)
);


I'm guessing it should be slower (unless string.Split uses a hashtable internally), but is nice and tidy ;)
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它6个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复