Creating a custom categorized corpus in NLTK and Python

前端 未结 1 1163
天命终不由人
天命终不由人 2020-12-01 07:15

I\'m experiencing a bit of a problem which has to do with regular expressions and CategorizedPlaintextCorpusReader in Python.

I want to create a custom

相关标签:
1条回答
  • 2020-12-01 07:53

    Here is the answer to my question. Since I was thinking about using two cases I think it's good to cover both in case someone needs the answer in the future. If you have the same setup as the movie_review corpus - several folders labeled in the same way you would like your labels to be called and containing the training data you can use this.

    reader = CategorizedPlaintextCorpusReader('~/MainFolder/', r'.*\.txt', cat_pattern=r'(\w+)/*')
    

    The other approach that I was considering is putting everything in a single folder and naming the files 0_neg.txt, 0_pos.txt, 1_neg.txt etc. The code for your reader should look something like:

    reader = CategorizedPlaintextCorpusReader('~/MainFolder/', r'.*\.txt', cat_pattern=r'\d+_(\w+)\.txt')
    

    I hope that this would help someone in the future.

    0 讨论(0)
提交回复
热议问题