Extracting all Nouns from a text file using nltk

前端 未结 7 1838
清歌不尽
清歌不尽 2020-12-08 08:35

Is there a more efficient way of doing this? My code reads a text file and extracts all Nouns.

import nltk

File = open(fileName) #open file
lines = File.rea         


        
相关标签:
7条回答
  • 2020-12-08 08:47

    I'm not an NLP expert, but I think you're pretty close already, and there likely isn't a way to get better than quadratic time complexity in these outer loops here.

    Recent versions of NLTK have a built in function that does what you're doing by hand, nltk.tag.pos_tag_sents, and it returns a list of lists of tagged words too.

    0 讨论(0)
  • 2020-12-08 08:50
    import nltk
    lines = 'lines is some string of words'
    tokenized = nltk.word_tokenize(lines)
    nouns = [word for (word, pos) in nltk.pos_tag(tokenized) if(pos[:2] == 'NN')]
    print (nouns)
    

    Just simplied abit more.

    0 讨论(0)
  • 2020-12-08 08:52

    if you get Resource punkt not found. Please use the NLTK Downloader to obtain the resource: error

    just do

    import nltk
    nltk.download('punkt')
    
    0 讨论(0)
  • 2020-12-08 08:56

    Your code has no redundancy: You read the file once and visit each sentence, and each tagged word, exactly once. No matter how you write your code (e.g., using comprehensions), you will only be hiding the nested loops, not skipping any processing.

    The only potential for improvement is in its space complexity: Instead of reading the whole file at once, you could read it in increments. But since you need to process a whole sentence at a time, it's not as simple as reading and processing one line at a time; so I wouldn't bother unless your files are whole gigabytes long; for short files it's not going to make any difference.

    In short, your loops are fine. There are a thing or two in your code that you could clean up (e.g. the if clause that matches the POS tags), but it's not going to change anything efficiency-wise.

    0 讨论(0)
  • 2020-12-08 09:06

    If you are open to options other than NLTK, check out TextBlob. It extracts all nouns and noun phrases easily:

    >>> from textblob import TextBlob
    >>> txt = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the inter
    actions between computers and human (natural) languages."""
    >>> blob = TextBlob(txt)
    >>> print(blob.noun_phrases)
    [u'natural language processing', 'nlp', u'computer science', u'artificial intelligence', u'computational linguistics']
    
    0 讨论(0)
  • 2020-12-08 09:08

    You can achieve good results using nltk, Textblob, SpaCy or any of the many other libraries out there. These libraries will all do the job but with different degrees of efficiency.

    import nltk
    from textblob import TextBlob
    import spacy
    nlp = spacy.load('en')
    nlp1 = spacy.load('en_core_web_lg')
    
    txt = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages."""
    

    On my windows 10 2 cores, 4 processors, 8GB ram i5 hp laptop, in jupyter notebook, I ran some comparisons and here are the results.

    For TextBlob:

    %%time
    print([w for (w, pos) in TextBlob(txt).pos_tags if pos[0] == 'N'])
    

    And the output is

    >>> ['language', 'processing', 'NLP', 'field', 'computer', 'science', 'intelligence', 'linguistics', 'inter', 'actions', 'computers', 'languages']
        Wall time: 8.01 ms #average over 20 iterations
    

    For nltk:

    %%time
    print([word for (word, pos) in nltk.pos_tag(nltk.word_tokenize(txt)) if pos[0] == 'N'])
    

    And the output is

    >>> ['language', 'processing', 'NLP', 'field', 'computer', 'science', 'intelligence', 'linguistics', 'inter', 'actions', 'computers', 'languages']
        Wall time: 7.09 ms #average over 20 iterations
    

    For spacy:

    %%time
    print([ent.text for ent in nlp(txt) if ent.pos_ == 'NOUN'])
    

    And the output is

    >>> ['language', 'processing', 'field', 'computer', 'science', 'intelligence', 'linguistics', 'inter', 'actions', 'computers', 'languages']
        Wall time: 30.19 ms #average over 20 iterations
    

    It seems nltk and TextBlob are reasonably faster and this is to be expected since store nothing else about the input text, txt. Spacy is way slower. One more thing. SpaCy missed the noun NLP while nltk and TextBlob got it. I would shot for nltk or TextBlob unless there is something else I wish to extract from the input txt.


    Check out a quick start into spacy here.
    Check out some basics about TextBlob here.
    Check out nltk HowTos here

    0 讨论(0)
提交回复
热议问题