tag generation from a text content

前端 未结 5 1609
北荒
北荒 2020-11-29 15:56

I am curious if there is an algorithm/method exists to generate keywords/tags from a given text, by using some weight calculations, occurrence ratio or other tools.

5条回答
  •  Happy的楠姐
    2020-11-29 16:04

    Latent Dirichlet allocation or Hierarchical Dirichlet Process can be used to generate tags for individual texts within a greater corpus (body of texts) by extracting the most important words from the derived topics.

    A basic example would be if we were to run LDA over a corpus and define it to have two topics, and that we find further that a text in the corpus is 70% one topic, and 30% another. The top 70% of the words that define the first topic and 30% that define the second (without duplication) could then be considered as tags for the given text. This method provides strong results where tags generally represent the broader themes of the given texts.

    With a general reference for preprocessing needed for these codes being found here, we can find tags through the following process using gensim.

    A heuristic way of deriving the optimal number of topics for LDA is found in this answer. Although HDP does not require the number of topics as an input, the standard in such cases is still to use LDA with a derived topic number, as HDP can be problematic. Assume here that the corpus is found to have 10 topics, and we want 5 tags per text:

    from gensim.models import LdaModel, HdpModel
    from gensim import corpora
    
    num_topics = 10
    num_tags = 5
    

    Assume further that we have a variable corpus, which is a preprocessed list of lists, with the subslist entries being word tokens. Initialize a Dirichlet dictionary and create a bag of words where texts are converted to their indexes for their component tokens (words):

    dirichlet_dict = corpora.Dictionary(corpus)
    bow_corpus = [dirichlet_dict.doc2bow(text) for text in corpus]
    

    Create an LDA or HDP model:

    dirichlet_model = LdaModel(corpus=bow_corpus,
                               id2word=dirichlet_dict,
                               num_topics=num_topics,
                               update_every=1,
                               chunksize=len(bow_corpus),
                               passes=20,
                               alpha='auto')
    
    # dirichlet_model = HdpModel(corpus=bow_corpus, 
    #                            id2word=dirichlet_dict,
    #                            chunksize=len(bow_corpus))
    

    The following code produces ordered lists for the most important words per topic (note that here is where num_tags defines the desired tags per text):

    shown_topics = dirichlet_model.show_topics(num_topics=num_topics, 
                                               num_words=num_tags,
                                               formatted=False)
    model_topics = [[word[0] for word in topic[1]] for topic in shown_topics]
    

    Then find the coherence of the topics across the texts:

    topic_corpus = dirichlet_model.__getitem__(bow=bow_corpus, eps=0) # cutoff probability to 0 
    topics_per_text = [text for text in topic_corpus]
    

    From here we have the percentage that each text coheres to a given topic, and the words associated with each topic, so we can combine them for tags with the following:

    corpus_tags = []
    
    for i in range(len(bow_corpus)):
        # The complexity here is to make sure that it works with HDP
        significant_topics = list(set([t[0] for t in topics_per_text[i]]))
        topic_indexes_by_coherence = [tup[0] for tup in sorted(enumerate(topics_per_text[i]), key=lambda x:x[1])]
        significant_topics_by_coherence = [significant_topics[i] for i in topic_indexes_by_coherence]
    
        ordered_topics = [model_topics[i] for i in significant_topics_by_coherence][:num_topics] # subset for HDP
        ordered_topic_coherences = [topics_per_text[i] for i in topic_indexes_by_coherence][:num_topics] # subset for HDP
    
        text_tags = []
        for i in range(num_topics):
            # Find the number of indexes to select, which can later be extended if the word has already been selected
            selection_indexes = list(range(int(round(num_tags * ordered_topic_coherences[i]))))
            if selection_indexes == [] and len(text_tags) < num_tags: 
                # Fix potential rounding error by giving this topic one selection
                selection_indexes = [0]
                  
            for s_i in selection_indexes:
                # ignore_words is a list of words should not be included
                if ordered_topics[i][s_i] not in text_tags and ordered_topics[i][s_i] not in ignore_words:
                    text_tags.append(ordered_topics[i][s_i])
                else:
                    selection_indexes.append(selection_indexes[-1] + 1)
    
        # Fix for if too many were selected
        text_tags = text_tags[:num_tags]
    
        corpus_tags.append(text_tags)
    

    corpus_tags will be a list of tags for each text based on how coherent the text is to the derived topics.

    See this answer for a similar version of this that generates tags for a whole text corpus.

提交回复
热议问题