For each concept of my dataset I have stored the corresponding wikipedia categories. For example, consider the following 5 concepts and their corresponding wikipedia categories.
The wikipedia
library is also a good bet to extract the categories from a given page, as wikipedia.WikipediaPage(page).categories
returns a simple list. The library also lets you search multiple pages should they all have the same title.
In medicine there seems to be a lot of key roots and suffixes, so the approach of finding key words may be a good approach to finding medical terms.
import wikipedia
def categorySorter(targetCats, pagesToCheck, mainCategory):
targetList = []
nonTargetList = []
targetCats = [i.lower() for i in targetCats]
print('Sorting pages...')
print('Sorted:', end=' ', flush=True)
for page in pagesToCheck:
e = openPage(page)
def deepList(l):
for item in l:
if item[1] == 'SUBPAGE_ID':
deepList(item[2])
else:
catComparator(item[0], item[1], targetCats, targetList, nonTargetList, pagesToCheck[-1])
if e[1] == 'SUBPAGE_ID':
deepList(e[2])
else:
catComparator(e[0], e[1], targetCats, targetList, nonTargetList, pagesToCheck[-1])
print()
print()
print('Results:')
print(mainCategory, ': ', targetList, sep='')
print()
print('Non-', mainCategory, ': ', nonTargetList, sep='')
def openPage(page):
try:
pageList = [page, wikipedia.WikipediaPage(page).categories]
except wikipedia.exceptions.PageError as p:
pageList = [page, 'NONEXIST_ID']
return
except wikipedia.exceptions.DisambiguationError as e:
pageCategories = []
for i in e.options:
if '(disambiguation)' not in i:
pageCategories.append(openPage(i))
pageList = [page, 'SUBPAGE_ID', pageCategories]
return pageList
finally:
return pageList
def catComparator(pageTitle, pageCategories, targetCats, targetList, nonTargetList, lastPage):
# unhash to view the categories of each page
#print(pageCategories)
pageCategories = [i.lower() for i in pageCategories]
any_in = False
for i in targetCats:
if i in pageTitle:
any_in = True
if any_in:
print('', end = '', flush=True)
elif compareLists(targetCats, pageCategories):
any_in = True
if any_in:
targetList.append(pageTitle)
else:
nonTargetList.append(pageTitle)
# Just prints a pretty list, you can comment out until next hash if desired
if any_in:
print(pageTitle, '(T)', end='', flush=True)
else:
print(pageTitle, '(F)',end='', flush=True)
if pageTitle != lastPage:
print(',', end=' ')
# No more commenting
return any_in
def compareLists (a, b):
for i in a:
for j in b:
if i in j:
return True
return False
The code is really just comparing a lists of key words and suffixes to the titles of each page as well as their categories to determine if a page is medically related. It also looks at related pages/sub pages for the bigger topics, and determines if those are related as well. I am not well versed in my medicine so forgive the categories but here is an example to tag onto the bottom:
medicalCategories = ['surgery', 'medic', 'disease', 'drugs', 'virus', 'bact', 'fung', 'pharma', 'cardio', 'pulmo', 'sensory', 'nerv', 'derma', 'protein', 'amino', 'unii', 'chlor', 'carcino', 'oxi', 'oxy', 'sis', 'disorder', 'enzyme', 'eine', 'sulf']
listOfPages = ['juvenile chronic arthritis', 'climate', 'alexidine', 'mouthrinse', 'sialosis', 'australia', 'artificial neural network', 'ricinoleic acid', 'bromosulfophthalein', 'myelosclerosis', 'hydrochloride salt', 'cycasin', 'aldosterone antagonist', 'fungal growth', 'describe', 'liver resection', 'coffee table', 'natural language processing', 'infratemporal fossa', 'social withdrawal', 'information retrieval', 'monday', 'menthol', 'overturn', 'prevailing', 'spline function', 'acinic cell carcinoma', 'furth', 'hepatic protein', 'blistering', 'prefixation', 'january', 'cardiopulmonary receptor', 'extracorporeal membrane oxygenation', 'clinodactyly', 'melancholic', 'chlorpromazine hydrochloride', 'level of evidence', 'washington state', 'cat', 'year elevan', 'trituration', 'gold alloy', 'hexoprenaline', 'second molar', 'novice', 'oxygen radical', 'subscription', 'ordinate', 'approximal', 'spongiosis', 'ribothymidine', 'body of evidence', 'vpb', 'porins', 'musculocutaneous']
categorySorter(medicalCategories, listOfPages, 'Medical')
This example list gets ~70% of what should be on the list, at least to my knowledge.