How to output NLTK chunks to file?

后端 未结 2 1333
终归单人心
终归单人心 2021-01-19 13:24

I have this python script where I am using nltk library to parse,tokenize,tag and chunk some lets say random text from the web.

I need to format and write in a file

2条回答
  •  死守一世寂寞
    2021-01-19 14:11

    Firstly, see this video: https://www.youtube.com/watch?v=0Ef9GudbxXY

    enter image description here

    Now for the proper answer:

    import re
    import io
    
    from nltk import pos_tag, word_tokenize, sent_tokenize, RegexpParser
    
    
    xstring = u"An electronic library (also referred to as digital library or digital repository) is a focused collection of digital objects that can include text, visual material, audio material, video material, stored as electronic media formats (as opposed to print, micro form, or other media), along with means for organizing, storing, and retrieving the files and media contained in the library collection. Digital libraries can vary immensely in size and scope, and can be maintained by individuals, organizations, or affiliated with established physical library buildings or institutions, or with academic institutions.[1] The electronic content may be stored locally, or accessed remotely via computer networks. An electronic library is a type of information retrieval system."
    
    
    chunkGram1 = r"""Chunk: {*}"""
    chunkParser1 = RegexpParser(chunkGram1)
    
    chunked = [chunkParser1.parse(pos_tag(word_tokenize(sent))) 
                for sent in sent_tokenize(xstring)]
    
    with io.open('outfile', 'w', encoding='utf8') as fout:
        for chunk in chunked:
            fout.write(str(chunk)+'\n\n')
    

    [out]:

    alvas@ubi:~$ python test2.py
    Traceback (most recent call last):
      File "test2.py", line 18, in 
        fout.write(str(chunk)+'\n\n')
    TypeError: must be unicode, not str
    alvas@ubi:~$ python3 test2.py
    alvas@ubi:~$ head outfile
    (S
      An/DT
      (Chunk electronic/JJ library/NN)
      (/:
      also/RB
      referred/VBD
      to/TO
      as/IN
      (Chunk digital/JJ library/NN)
      or/CC
    

    If you have to stick to python2.7:

    with io.open('outfile', 'w', encoding='utf8') as fout:
        for chunk in chunked:
            fout.write(unicode(chunk)+'\n\n')
    

    [out]:

    alvas@ubi:~$ python test2.py
    alvas@ubi:~$ head outfile
    (S
      An/DT
      (Chunk electronic/JJ library/NN)
      (/:
      also/RB
      referred/VBD
      to/TO
      as/IN
      (Chunk digital/JJ library/NN)
      or/CC
    alvas@ubi:~$ python3 test2.py
    Traceback (most recent call last):
      File "test2.py", line 18, in 
        fout.write(unicode(chunk)+'\n\n')
    NameError: name 'unicode' is not defined
    

    And strongly recommended if you must stick with py2.7:

    from six import text_type
    with io.open('outfile', 'w', encoding='utf8') as fout:
        for chunk in chunked:
            fout.write(text_type(chunk)+'\n\n')
    

    [out]:

    alvas@ubi:~$ python test2.py
    alvas@ubi:~$ head outfile 
    (S
      An/DT
      (Chunk electronic/JJ library/NN)
      (/:
      also/RB
      referred/VBD
      to/TO
      as/IN
      (Chunk digital/JJ library/NN)
      or/CC
    alvas@ubi:~$ python3 test2.py
    alvas@ubi:~$ head outfile 
    (S
      An/DT
      (Chunk electronic/JJ library/NN)
      (/:
      also/RB
      referred/VBD
      to/TO
      as/IN
      (Chunk digital/JJ library/NN)
      or/CC
    

提交回复
热议问题