I have a set of 3000 text documents and I want to extract top 300 keywords (could be single word or multiple words).
I have tried the below approaches -
RAK
import os
import operator
from collections import defaultdict
files = os.listdir()
topWords = ["word1", "word2.... etc"]
wordsCount = 0
words = defaultdict(lambda: 0)
for file in files:
open_file = open(file, "r")
for line in open_file.readlines():
raw_words = line.split()
for word in raw_words:
words[word] += 1
sorted_words = sorted(words.items(), key=operator.itemgetter(1))
now take top 300 from sorted words, they are the words you want.
Is better for you to choose manually those 300 words (it's not so much and is one time) - Code Written in Python 3
import os
files = os.listdir()
topWords = ["word1", "word2.... etc"]
wordsCount = 0
for file in files:
file_opened = open(file, "r")
lines = file_opened.read().split("\n")
for word in topWords:
if word in lines and wordsCount < 301:
print("I found %s" %word)
wordsCount += 1
#Check Again wordsCount to close first repetitive instruction
if wordsCount == 300:
break
Although Latent Dirichlet allocation and Hierarchical Dirichlet Process are typically used to derive topics within a text corpus and then use these topics to classify individual entries, a method to derive keywords for the entire corpus can also be developed. This method benefits from not relying on another text corpus. A basic workflow would be to compare these Dirichlet keywords to the most common words to see if LDA or HDP is able to pick up on important words that are not included in the most common ones.
Before using the following codes, it’s generally suggested that the following is done for text preprocessing:
These steps would create the variable corpus
in the following. A good overview of all this with an explanation of LDA can be found here.
Now for LDA and HDP with gensim:
from gensim.models import LdaModel, HdpModel
from gensim import corpora
First create a dirichlet dictionary that maps the words in corpus
to indexes, and then use this to create a bag of words where the tokens within corpus
are replaced by their indexes. This is done via:
dirichlet_dict = corpora.Dictionary(corpus)
bow_corpus = [dirichlet_dict.doc2bow(text) for text in corpus]
For LDA, the optimal number of topics needs to derived, which can be heuristically done through the method in this answer. Let's assume that our optimal number of topics is 10, and as per the question we want 300 keywords:
num_topics = 10
num_keywords = 300
Create an LDA model:
dirichlet_model = LdaModel(corpus=bow_corpus,
id2word=dirichlet_dict,
num_topics=num_topics,
update_every=1,
chunksize=len(bow_corpus),
passes=20,
alpha='auto')
Next comes a function to derive the best topics based on their average coherence across the corpus. First an ordered lists for the most important words per topic will be produced; then the average coherence of each topic to the whole corpus is found; and finally topics are ordered based on this average coherence and returned along with a list of the averages to be used later. The code for all this is as follows (includes the option to use HDP from below):
def order_subset_by_coherence(dirichlet_model, bow_corpus, num_topics=10, num_keywords=10):
"""
Orders topics based on their average coherence across the corpus
Parameters
----------
dirichlet_model : gensim.models.type_of_model
bow_corpus : list of lists (contains (id, freq) tuples)
num_topics : int (default=10)
num_keywords : int (default=10)
Returns
-------
ordered_topics, ordered_topic_averages: list of lists and list
"""
if type(dirichlet_model) == gensim.models.ldamodel.LdaModel:
shown_topics = dirichlet_model.show_topics(num_topics=num_topics,
num_words=num_keywords,
formatted=False)
elif type(dirichlet_model) == gensim.models.hdpmodel.HdpModel:
shown_topics = dirichlet_model.show_topics(num_topics=150, # return all topics
num_words=num_keywords,
formatted=False)
model_topics = [[word[0] for word in topic[1]] for topic in shown_topics]
topic_corpus = dirichlet_model.__getitem__(bow=bow_corpus, eps=0) # cutoff probability to 0
topics_per_response = [response for response in topic_corpus]
flat_topic_coherences = [item for sublist in topics_per_response for item in sublist]
significant_topics = list(set([t_c[0] for t_c in flat_topic_coherences])) # those that appear
topic_averages = [sum([t_c[1] for t_c in flat_topic_coherences if t_c[0] == topic_num]) / len(bow_corpus) \
for topic_num in significant_topics]
topic_indexes_by_avg_coherence = [tup[0] for tup in sorted(enumerate(topic_averages), key=lambda i:i[1])[::-1]]
significant_topics_by_avg_coherence = [significant_topics[i] for i in topic_indexes_by_avg_coherence]
ordered_topics = [model_topics[i] for i in significant_topics_by_avg_coherence][:num_topics] # limit for HDP
ordered_topic_averages = [topic_averages[i] for i in topic_indexes_by_avg_coherence][:num_topics] # limit for HDP
ordered_topic_averages = [a/sum(ordered_topic_averages) for a in ordered_topic_averages] # normalize HDP values
return ordered_topics, ordered_topic_averages
Now to get a list of keywords - the most important words across the topics. This is done by subsetting the words (which again are ordered by significance by default) from each of the ordered topics based on their average coherence to the whole. To explain explicitly, assume that there are just have two topics, and the texts are 70% coherent to the first, and 30% to the second. Keywords could then be the top 70% of words from the first topic, and the top 30% from the second that have not already been selected. This is achieved via the following:
ordered_topics, ordered_topic_averages = \
order_subset_by_coherence(dirichlet_model=dirichlet_model,
bow_corpus=bow_corpus,
num_topics=num_topics,
num_keywords=num_keywords)
keywords = []
for i in range(num_topics):
# Find the number of indexes to select, which can later be extended if the word has already been selected
selection_indexes = list(range(int(round(num_keywords * ordered_topic_averages[i]))))
if selection_indexes == [] and len(keywords) < num_keywords:
# Fix potential rounding error by giving this topic one selection
selection_indexes = [0]
for s_i in selection_indexes:
if ordered_topics[i][s_i] not in keywords and ordered_topics[i][s_i] not in ignore_words:
keywords.append(ordered_topics[i][s_i])
else:
selection_indexes.append(selection_indexes[-1] + 1)
# Fix for if too many were selected
keywords = keywords[:num_keywords]
The above also includes the variable ignore_words
, which is a list of words that should not be included in the results.
For HDP the model follows a similar process to the above, except that num_topics and other arguments do not need to be passed in model creation. HDP derives optimal topics itself, but then these topics will need to be ordered and subsetted using order_subset_by_coherence
to assure that the best topics are used for a finite selection. A model is created via:
dirichlet_model = HdpModel(corpus=bow_corpus,
id2word=dirichlet_dict,
chunksize=len(bow_corpus))
It is best to test both LDA and HDP, as LDA can outperform based on the needs of the problem if a suitable number of topics is able to be found (this is still the standard over HDP). Compare the Dirichlet keywords to word frequencies alone, and hopefully what's generated is a list of keywords that are more related to the overall theme of the text, not simply the words that are most common.
Obviously selecting ordered words from topics based on percent text coherence doesn’t give an overall ordering of the keywords by importance, as some words that are very important in topics with less overall coherence will be selected later.
The process for using LDA to generate keywords for the individual texts within the corpus can be found in this answer.
Most easy and effective way to apply the tf-idf implementation for most important words. if you have stop word you can filter the stops words before apply this code. hope this works for you.
import java.util.List;
/**
* Class to calculate TfIdf of term.
* @author Mubin Shrestha
*/
public class TfIdf {
/**
* Calculates the tf of term termToCheck
* @param totalterms : Array of all the words under processing document
* @param termToCheck : term of which tf is to be calculated.
* @return tf(term frequency) of term termToCheck
*/
public double tfCalculator(String[] totalterms, String termToCheck) {
double count = 0; //to count the overall occurrence of the term termToCheck
for (String s : totalterms) {
if (s.equalsIgnoreCase(termToCheck)) {
count++;
}
}
return count / totalterms.length;
}
/**
* Calculates idf of term termToCheck
* @param allTerms : all the terms of all the documents
* @param termToCheck
* @return idf(inverse document frequency) score
*/
public double idfCalculator(List allTerms, String termToCheck) {
double count = 0;
for (String[] ss : allTerms) {
for (String s : ss) {
if (s.equalsIgnoreCase(termToCheck)) {
count++;
break;
}
}
}
return 1 + Math.log(allTerms.size() / count);
}
}