How do I tokenize a string sentence in NLTK?

后端 未结 2 1152
鱼传尺愫
鱼传尺愫 2020-11-29 06:58

I am using nltk, so I want to create my own custom texts just like the default ones on nltk.books. However, I\'ve just got up to the method like

my_text = [\         


        
相关标签:
2条回答
  • 2020-11-29 07:30

    As @PavelAnossov answered, the canonical answer, use the word_tokenize function in nltk:

    from nltk import word_tokenize
    sent = "This is my text, this is a nice way to input text."
    word_tokenize(sent)
    

    If your sentence is truly simple enough:

    Using the string.punctuation set, remove punctuation then split using the whitespace delimiter:

    import string
    x = "This is my text, this is a nice way to input text."
    y = "".join([i for i in x if not in string.punctuation]).split(" ")
    print y
    
    0 讨论(0)
  • 2020-11-29 07:43

    This is actually on the main page of nltk.org:

    >>> import nltk
    >>> sentence = """At eight o'clock on Thursday morning
    ... Arthur didn't feel very good."""
    >>> tokens = nltk.word_tokenize(sentence)
    >>> tokens
    ['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
    'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
    
    0 讨论(0)
提交回复
热议问题