For each concept of my dataset I have stored the corresponding wikipedia categories. For example, consider the following 5 concepts and their corresponding wikipedia categories.
There is a concept of word Vectors in NLP, what it basically does is by looking through mass volumes of text, it tries to convert words to multi-dimensional vectors and then lesser the distance between those vectors, greater the similarity between them, the good thing is that many people have already generated this word vectors and made them available under very permissive licences, and in your case you are working with Wikipedia and there exist word vectors for them here http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
Now these would be the most suited for this task since they contain most words from Wikipedia's corpora, but in case they are not suited for you, or are removed in the future you can use one from I will list below more of these, with that said, there is a better way to do this, i.e. by passing them to tensorflow's universal language model embed module in which you don't have to do most of the heavy lifting, you can read more about that here. The reason I put it after the Wikipedia text dump is because I have heard people say that they are a bit hard to work with when working with medical samples. This paper does propose a solution to tackle that but I have never tried that so I cannot be sure of it's accuracies.
Now how you can use the word embeddings from tensorflow is simple, just do
embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder/2")
embeddings = embed(["Input Text here as"," List of strings"])
session.run(embeddings)
Since you might not be familiar with tensorflow and trying to run just this piece of code you might run into some troubles, Follow this link where they have mentioned completely how to use this and from there you should be able to easily modify this to your needs.
With that said I would recommend first checking out he tensorlfow's embed module and their pre-trained word embedding's, if they don't work for you check out the Wikimedia link, if that also doesn't work then proceed to the concepts of the paper I have linked. Since this answer is describing an NLP approach, it will not be 100% accurate, so keep that in mind before you proceed.
Glove Vectors https://nlp.stanford.edu/projects/glove/
Facebook's fast text: https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
Or this http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz
If you run into problems implementing this after following the colab tutorial add your problem to the question and comment below, from there we can proceed further.
Edit Added code to cluster topics
Brief, Rather than using words vector, I am encoding their summary sentences
file content.py
def AllTopics():
topics = []# list all your topics, not added here for space restricitons
for i in range(len(topics)-1):
yield topics[i]
File summaryGenerator.py
import wikipedia
import pickle
from content import Alltopics
summary = []
failed = []
for topic in Alltopics():
try:
summary.append(wikipedia.summary(tuple((topic,str(topic)))))
except Exception as e:
failed.append(tuple((topic,e)))
with open("summary.txt", "wb") as fp:
pickle.dump(summary , fp)
with open('failed.txt', 'wb') as fp:
pickle.dump('failed', fp)
File SimilartiyCalculator.py
import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
import os
import pandas as pd
import re
import pickle
import sys
from sklearn.cluster import AgglomerativeClustering
from sklearn import metrics
from scipy.cluster import hierarchy
from scipy.spatial import distance_matrix
try:
with open("summary.txt", "rb") as fp: # Unpickling
summary = pickle.load(fp)
except Exception as e:
print ('Cannot load the summary file, Please make sure that it exists, if not run Summary Generator first', e)
sys.exit('Read the error message')
module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/3"
embed = hub.Module(module_url)
tf.logging.set_verbosity(tf.logging.ERROR)
messages = [x[1] for x in summary]
labels = [x[0] for x in summary]
with tf.Session() as session:
session.run([tf.global_variables_initializer(), tf.tables_initializer()])
message_embeddings = session.run(embed(messages)) # In message embeddings each vector is a second (1,512 vector) and is numpy.ndarray (noOfElemnts, 512)
X = message_embeddings
agl = AgglomerativeClustering(n_clusters=5, affinity='euclidean', memory=None, connectivity=None, compute_full_tree='auto', linkage='ward', pooling_func='deprecated')
agl.fit(X)
dist_matrix = distance_matrix(X,X)
Z = hierarchy.linkage(dist_matrix, 'complete')
dendro = hierarchy.dendrogram(Z)
cluster_labels = agl.labels_
This is also hosted on GitHub at https://github.com/anandvsingh/WikipediaSimilarity Where you can find the similarity.txt file, and other files, In my case I couldn't run it on all the topics, but I would urge you to run it on the full list of topics (Directly clone the repository and run SummaryGenerator.py), and upload the similarity.txt via a pull request in case you don't get expected result. And if possible also upload the message_embeddings in a csv file as topics and there embeddings.
Changes after edit 2
Switched the similarityGenerator to a hierarchy based clustering(Agglomerative) I would suggest you to keep the title names at the bottom of the dendrogram and for that look at the definition of dendrogram here, I verified viewing some samples and the results look quite good, you can change the n_clusters value to fine tune your model. Note: This requires you to run summary generator again. I think you should be able to take it from here, what you have to do is try a few values of n_cluster and see in which all medical terms are grouped together, then find the cluster_label for that cluster and you are done. Since here we group by summary, the clusters will be more accurate. If you run into any problems or don't understand something, comment below.