I am learning Natural Language Processing using NLTK.
I came across the code using PunktSentenceTokenizer whose actual use I cannot understand in the given code
PunktSentenceTokenizer is an sentence boundary detection algorithm that must be trained to be used [1]. NLTK already includes a pre-trained version of the PunktSentenceTokenizer.
So if you use initialize the tokenizer without any arguments, it will default to the pre-trained version:
In [1]: import nltk
In [2]: tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
In [3]: txt = """ This is one sentence. This is another sentence."""
In [4]: tokenizer.tokenize(txt)
Out[4]: [' This is one sentence.', 'This is another sentence.']
You can also provide your own training data to train the tokenizer before using it. Punkt tokenizer uses an unsupervised algorithm, meaning you just train it with regular text.
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
For most of the cases, it is totally fine to use the pre-trained version. So you can simply initialize the tokenizer without providing any arguments.
So "what all this has to do with POS tagging"? The NLTK POS tagger works with tokenized sentences, so you need to break your text into sentences and word tokens before you can POS tag.
NLTK's documentation.
[1] Kiss and Strunk, " Unsupervised Multilingual Sentence Boundary Detection"