Extracting all Nouns from a text file using nltk

前端未结

关注

 7  1861

清歌不尽 2020-12-08 08:35

Is there a more efficient way of doing this? My code reads a text file and extracts all Nouns.

import nltk

File = open(fileName) #open file
lines = File.rea


      
      
        
          7条回答        

        
                    
            
            
                         
                
              
              
                
                   轻奢々
                                             
                
                
                (楼主)
            
              
              
                2020-12-08 09:08
              

            
            
                        
You can achieve good results using nltk, Textblob, SpaCy or any of the many other libraries out there. These libraries will all do the job but with different degrees of efficiency.

import nltk
from textblob import TextBlob
import spacy
nlp = spacy.load('en')
nlp1 = spacy.load('en_core_web_lg')

txt = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages."""


On my windows 10 2 cores, 4 processors, 8GB ram i5 hp  laptop, in jupyter notebook, I ran some comparisons and here are the results.

For TextBlob:

%%time
print([w for (w, pos) in TextBlob(txt).pos_tags if pos[0] == 'N'])


And the output is

>>> ['language', 'processing', 'NLP', 'field', 'computer', 'science', 'intelligence', 'linguistics', 'inter', 'actions', 'computers', 'languages']
    Wall time: 8.01 ms #average over 20 iterations


For nltk:

%%time
print([word for (word, pos) in nltk.pos_tag(nltk.word_tokenize(txt)) if pos[0] == 'N'])


And the output is

>>> ['language', 'processing', 'NLP', 'field', 'computer', 'science', 'intelligence', 'linguistics', 'inter', 'actions', 'computers', 'languages']
    Wall time: 7.09 ms #average over 20 iterations


For spacy:

%%time
print([ent.text for ent in nlp(txt) if ent.pos_ == 'NOUN'])


And the output is

>>> ['language', 'processing', 'field', 'computer', 'science', 'intelligence', 'linguistics', 'inter', 'actions', 'computers', 'languages']
    Wall time: 30.19 ms #average over 20 iterations


It seems nltk and TextBlob are reasonably faster and this is to be expected since store nothing else about the input text, txt. Spacy is way slower. One more thing. SpaCy missed the noun NLP while nltk and TextBlob got it. I would shot for nltk or TextBlob unless there is something else I wish to extract from the input txt.


 Check out a quick start into spacy here.

 Check out some basics about TextBlob here.
 Check out nltk HowTos here
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它7个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复