How to group wikipedia categories in python?

后端未结

关注

 6  668

心在旅途 2021-02-01 06:58

For each concept of my dataset I have stored the corresponding wikipedia categories. For example, consider the following 5 concepts and their corresponding wikipedia categories.

6条回答

自闭症患者 (楼主)

2021-02-01 07:24

The wikipedia library is also a good bet to extract the categories from a given page, as wikipedia.WikipediaPage(page).categories returns a simple list. The library also lets you search multiple pages should they all have the same title.

In medicine there seems to be a lot of key roots and suffixes, so the approach of finding key words may be a good approach to finding medical terms.

import wikipedia

def categorySorter(targetCats, pagesToCheck, mainCategory):
    targetList = []
    nonTargetList = []
    targetCats = [i.lower() for i in targetCats]

    print('Sorting pages...')
    print('Sorted:', end=' ', flush=True)
    for page in pagesToCheck:

        e = openPage(page)

        def deepList(l):
            for item in l:
                if item[1] == 'SUBPAGE_ID':
                    deepList(item[2])
                else:
                    catComparator(item[0], item[1], targetCats, targetList, nonTargetList, pagesToCheck[-1])

        if e[1] == 'SUBPAGE_ID':
            deepList(e[2])
        else:
            catComparator(e[0], e[1], targetCats, targetList, nonTargetList, pagesToCheck[-1])

    print()
    print()
    print('Results:')
    print(mainCategory, ': ', targetList, sep='')
    print()
    print('Non-', mainCategory, ': ', nonTargetList, sep='')

def openPage(page):
    try:
        pageList = [page, wikipedia.WikipediaPage(page).categories]
    except wikipedia.exceptions.PageError as p:
        pageList = [page, 'NONEXIST_ID']
        return
    except wikipedia.exceptions.DisambiguationError as e:
        pageCategories = []
        for i in e.options:
            if '(disambiguation)' not in i:
                pageCategories.append(openPage(i))
        pageList = [page, 'SUBPAGE_ID', pageCategories]
        return pageList
    finally:
        return pageList

def catComparator(pageTitle, pageCategories, targetCats, targetList, nonTargetList, lastPage):

    # unhash to view the categories of each page
    #print(pageCategories)
    pageCategories = [i.lower() for i in pageCategories]

    any_in = False
    for i in targetCats:
        if i in pageTitle:
            any_in = True
    if any_in:
        print('', end = '', flush=True)
    elif compareLists(targetCats, pageCategories):
        any_in = True

    if any_in:
        targetList.append(pageTitle)
    else:
        nonTargetList.append(pageTitle)

    # Just prints a pretty list, you can comment out until next hash if desired
    if any_in:
        print(pageTitle, '(T)', end='', flush=True)
    else:
        print(pageTitle, '(F)',end='', flush=True)

    if pageTitle != lastPage:
        print(',', end=' ')
    # No more commenting

    return any_in

def compareLists (a, b):
    for i in a:
        for j in b:
            if i in j:
                return True
    return False

The code is really just comparing a lists of key words and suffixes to the titles of each page as well as their categories to determine if a page is medically related. It also looks at related pages/sub pages for the bigger topics, and determines if those are related as well. I am not well versed in my medicine so forgive the categories but here is an example to tag onto the bottom:

medicalCategories = ['surgery', 'medic', 'disease', 'drugs', 'virus', 'bact', 'fung', 'pharma', 'cardio', 'pulmo', 'sensory', 'nerv', 'derma', 'protein', 'amino', 'unii', 'chlor', 'carcino', 'oxi', 'oxy', 'sis', 'disorder', 'enzyme', 'eine', 'sulf']
listOfPages = ['juvenile chronic arthritis', 'climate', 'alexidine', 'mouthrinse', 'sialosis', 'australia', 'artificial neural network', 'ricinoleic acid', 'bromosulfophthalein', 'myelosclerosis', 'hydrochloride salt', 'cycasin', 'aldosterone antagonist', 'fungal growth', 'describe', 'liver resection', 'coffee table', 'natural language processing', 'infratemporal fossa', 'social withdrawal', 'information retrieval', 'monday', 'menthol', 'overturn', 'prevailing', 'spline function', 'acinic cell carcinoma', 'furth', 'hepatic protein', 'blistering', 'prefixation', 'january', 'cardiopulmonary receptor', 'extracorporeal membrane oxygenation', 'clinodactyly', 'melancholic', 'chlorpromazine hydrochloride', 'level of evidence', 'washington state', 'cat', 'year elevan', 'trituration', 'gold alloy', 'hexoprenaline', 'second molar', 'novice', 'oxygen radical', 'subscription', 'ordinate', 'approximal', 'spongiosis', 'ribothymidine', 'body of evidence', 'vpb', 'porins', 'musculocutaneous']
categorySorter(medicalCategories, listOfPages, 'Medical')

This example list gets ~70% of what should be on the list, at least to my knowledge.

0 讨论(0)

查看其它6个回答