How to group wikipedia categories in python?

后端 未结 6 668
心在旅途
心在旅途 2021-02-01 06:58

For each concept of my dataset I have stored the corresponding wikipedia categories. For example, consider the following 5 concepts and their corresponding wikipedia categories.

6条回答
  •  自闭症患者
    2021-02-01 07:24

    The wikipedia library is also a good bet to extract the categories from a given page, as wikipedia.WikipediaPage(page).categories returns a simple list. The library also lets you search multiple pages should they all have the same title.

    In medicine there seems to be a lot of key roots and suffixes, so the approach of finding key words may be a good approach to finding medical terms.

    import wikipedia
    
    def categorySorter(targetCats, pagesToCheck, mainCategory):
        targetList = []
        nonTargetList = []
        targetCats = [i.lower() for i in targetCats]
    
        print('Sorting pages...')
        print('Sorted:', end=' ', flush=True)
        for page in pagesToCheck:
    
            e = openPage(page)
    
            def deepList(l):
                for item in l:
                    if item[1] == 'SUBPAGE_ID':
                        deepList(item[2])
                    else:
                        catComparator(item[0], item[1], targetCats, targetList, nonTargetList, pagesToCheck[-1])
    
            if e[1] == 'SUBPAGE_ID':
                deepList(e[2])
            else:
                catComparator(e[0], e[1], targetCats, targetList, nonTargetList, pagesToCheck[-1])
    
        print()
        print()
        print('Results:')
        print(mainCategory, ': ', targetList, sep='')
        print()
        print('Non-', mainCategory, ': ', nonTargetList, sep='')
    
    def openPage(page):
        try:
            pageList = [page, wikipedia.WikipediaPage(page).categories]
        except wikipedia.exceptions.PageError as p:
            pageList = [page, 'NONEXIST_ID']
            return
        except wikipedia.exceptions.DisambiguationError as e:
            pageCategories = []
            for i in e.options:
                if '(disambiguation)' not in i:
                    pageCategories.append(openPage(i))
            pageList = [page, 'SUBPAGE_ID', pageCategories]
            return pageList
        finally:
            return pageList
    
    def catComparator(pageTitle, pageCategories, targetCats, targetList, nonTargetList, lastPage):
    
        # unhash to view the categories of each page
        #print(pageCategories)
        pageCategories = [i.lower() for i in pageCategories]
    
        any_in = False
        for i in targetCats:
            if i in pageTitle:
                any_in = True
        if any_in:
            print('', end = '', flush=True)
        elif compareLists(targetCats, pageCategories):
            any_in = True
    
        if any_in:
            targetList.append(pageTitle)
        else:
            nonTargetList.append(pageTitle)
    
        # Just prints a pretty list, you can comment out until next hash if desired
        if any_in:
            print(pageTitle, '(T)', end='', flush=True)
        else:
            print(pageTitle, '(F)',end='', flush=True)
    
        if pageTitle != lastPage:
            print(',', end=' ')
        # No more commenting
    
        return any_in
    
    def compareLists (a, b):
        for i in a:
            for j in b:
                if i in j:
                    return True
        return False
    

    The code is really just comparing a lists of key words and suffixes to the titles of each page as well as their categories to determine if a page is medically related. It also looks at related pages/sub pages for the bigger topics, and determines if those are related as well. I am not well versed in my medicine so forgive the categories but here is an example to tag onto the bottom:

    medicalCategories = ['surgery', 'medic', 'disease', 'drugs', 'virus', 'bact', 'fung', 'pharma', 'cardio', 'pulmo', 'sensory', 'nerv', 'derma', 'protein', 'amino', 'unii', 'chlor', 'carcino', 'oxi', 'oxy', 'sis', 'disorder', 'enzyme', 'eine', 'sulf']
    listOfPages = ['juvenile chronic arthritis', 'climate', 'alexidine', 'mouthrinse', 'sialosis', 'australia', 'artificial neural network', 'ricinoleic acid', 'bromosulfophthalein', 'myelosclerosis', 'hydrochloride salt', 'cycasin', 'aldosterone antagonist', 'fungal growth', 'describe', 'liver resection', 'coffee table', 'natural language processing', 'infratemporal fossa', 'social withdrawal', 'information retrieval', 'monday', 'menthol', 'overturn', 'prevailing', 'spline function', 'acinic cell carcinoma', 'furth', 'hepatic protein', 'blistering', 'prefixation', 'january', 'cardiopulmonary receptor', 'extracorporeal membrane oxygenation', 'clinodactyly', 'melancholic', 'chlorpromazine hydrochloride', 'level of evidence', 'washington state', 'cat', 'year elevan', 'trituration', 'gold alloy', 'hexoprenaline', 'second molar', 'novice', 'oxygen radical', 'subscription', 'ordinate', 'approximal', 'spongiosis', 'ribothymidine', 'body of evidence', 'vpb', 'porins', 'musculocutaneous']
    categorySorter(medicalCategories, listOfPages, 'Medical')
    

    This example list gets ~70% of what should be on the list, at least to my knowledge.

提交回复
热议问题