How to group wikipedia categories in python?

后端 未结 6 728
心在旅途
心在旅途 2021-02-01 06:58

For each concept of my dataset I have stored the corresponding wikipedia categories. For example, consider the following 5 concepts and their corresponding wikipedia categories.

6条回答
  •  你的背包
    2021-02-01 07:16

    There is a concept of word Vectors in NLP, what it basically does is by looking through mass volumes of text, it tries to convert words to multi-dimensional vectors and then lesser the distance between those vectors, greater the similarity between them, the good thing is that many people have already generated this word vectors and made them available under very permissive licences, and in your case you are working with Wikipedia and there exist word vectors for them here http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

    Now these would be the most suited for this task since they contain most words from Wikipedia's corpora, but in case they are not suited for you, or are removed in the future you can use one from I will list below more of these, with that said, there is a better way to do this, i.e. by passing them to tensorflow's universal language model embed module in which you don't have to do most of the heavy lifting, you can read more about that here. The reason I put it after the Wikipedia text dump is because I have heard people say that they are a bit hard to work with when working with medical samples. This paper does propose a solution to tackle that but I have never tried that so I cannot be sure of it's accuracies.

    Now how you can use the word embeddings from tensorflow is simple, just do

    embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder/2")
    embeddings = embed(["Input Text here as"," List of strings"])
    session.run(embeddings)
    

    Since you might not be familiar with tensorflow and trying to run just this piece of code you might run into some troubles, Follow this link where they have mentioned completely how to use this and from there you should be able to easily modify this to your needs.

    With that said I would recommend first checking out he tensorlfow's embed module and their pre-trained word embedding's, if they don't work for you check out the Wikimedia link, if that also doesn't work then proceed to the concepts of the paper I have linked. Since this answer is describing an NLP approach, it will not be 100% accurate, so keep that in mind before you proceed.

    Glove Vectors https://nlp.stanford.edu/projects/glove/

    Facebook's fast text: https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

    Or this http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz

    If you run into problems implementing this after following the colab tutorial add your problem to the question and comment below, from there we can proceed further.

    Edit Added code to cluster topics

    Brief, Rather than using words vector, I am encoding their summary sentences

    file content.py

    def AllTopics():
        topics = []# list all your topics, not added here for space restricitons
        for i in range(len(topics)-1):
            yield topics[i]
    

    File summaryGenerator.py

    import wikipedia
    import pickle
    from content import Alltopics
    summary = []
    failed = []
    for topic in Alltopics():
        try:
            summary.append(wikipedia.summary(tuple((topic,str(topic)))))
        except Exception as e:
            failed.append(tuple((topic,e)))
    with open("summary.txt", "wb") as fp:
        pickle.dump(summary , fp)
    with open('failed.txt', 'wb') as fp:
        pickle.dump('failed', fp)
    

    File SimilartiyCalculator.py

    import tensorflow as tf
    import tensorflow_hub as hub
    import numpy as np
    import os
    import pandas as pd
    import re
    import pickle
    import sys
    from sklearn.cluster import AgglomerativeClustering
    from sklearn import metrics
    from scipy.cluster import hierarchy
    from scipy.spatial import distance_matrix
    
    
    try:
        with open("summary.txt", "rb") as fp:   # Unpickling
            summary = pickle.load(fp)
    except Exception as e:
        print ('Cannot load the summary file, Please make sure that it exists, if not run Summary Generator first', e)
        sys.exit('Read the error message')
    
    module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/3"
    embed = hub.Module(module_url)
    
    tf.logging.set_verbosity(tf.logging.ERROR)
    messages = [x[1] for x in summary]
    labels = [x[0] for x in summary]
    with tf.Session() as session:
        session.run([tf.global_variables_initializer(), tf.tables_initializer()])
        message_embeddings = session.run(embed(messages)) # In message embeddings each vector is a second (1,512 vector) and is numpy.ndarray (noOfElemnts, 512)
    
    X = message_embeddings
    agl = AgglomerativeClustering(n_clusters=5, affinity='euclidean', memory=None, connectivity=None, compute_full_tree='auto', linkage='ward', pooling_func='deprecated')
    agl.fit(X)
    dist_matrix = distance_matrix(X,X)
    Z = hierarchy.linkage(dist_matrix, 'complete')
    dendro = hierarchy.dendrogram(Z)
    cluster_labels = agl.labels_
    

    This is also hosted on GitHub at https://github.com/anandvsingh/WikipediaSimilarity Where you can find the similarity.txt file, and other files, In my case I couldn't run it on all the topics, but I would urge you to run it on the full list of topics (Directly clone the repository and run SummaryGenerator.py), and upload the similarity.txt via a pull request in case you don't get expected result. And if possible also upload the message_embeddings in a csv file as topics and there embeddings.

    Changes after edit 2 Switched the similarityGenerator to a hierarchy based clustering(Agglomerative) I would suggest you to keep the title names at the bottom of the dendrogram and for that look at the definition of dendrogram here, I verified viewing some samples and the results look quite good, you can change the n_clusters value to fine tune your model. Note: This requires you to run summary generator again. I think you should be able to take it from here, what you have to do is try a few values of n_cluster and see in which all medical terms are grouped together, then find the cluster_label for that cluster and you are done. Since here we group by summary, the clusters will be more accurate. If you run into any problems or don't understand something, comment below.

提交回复
热议问题