I have a set of 3000 text documents and I want to extract top 300 keywords (could be single word or multiple words).
I have tried the below approaches -
RAK
import os
import operator
from collections import defaultdict
files = os.listdir()
topWords = ["word1", "word2.... etc"]
wordsCount = 0
words = defaultdict(lambda: 0)
for file in files:
open_file = open(file, "r")
for line in open_file.readlines():
raw_words = line.split()
for word in raw_words:
words[word] += 1
sorted_words = sorted(words.items(), key=operator.itemgetter(1))
now take top 300 from sorted words, they are the words you want.