Extracting Key-Phrases from text based on the Topic with Python

后端 未结 3 424
被撕碎了的回忆
被撕碎了的回忆 2021-01-03 03:36

I have a large dataset with 3 columns, columns are text, phrase and topic. I want to find a way to extract key-phrases (phrases column) based on the topic. Key-Phrase can b

3条回答
  •  暖寄归人
    2021-01-03 04:39

    It appears you're looking to group short pieces of text by topic. You will have to tokenize the data in one way or another. There are a variety of encodings that you could consider:

    Bag of words, which classifies by counting the frequency of each word in your vocabulary.

    TF-IDF: Does what's above but makes words that appear in more entries less important

    n_grams / bigrams / trigrams which essentially does the bag of words method but also maintains some context around each word. So you'll have encodings for each word but you'll also have tokens for "great_game", "game_with" and "great_game_with" etc.

    Orthogonal Sparse Bigrams (OSB)s Also create features that have the words further apart, like "great__with"

    Any of these options could be ideal for your dataset (the last two are likely your best bet). If none of these options work, There are a few more options you could try:


    First you could use word embeddings. These are vector representations of each word that unlike one-hot-encoding intrinsically contain word meaning. You can sum the words in a sentence together to get a new vector containing the general idea of what the sentence is about which can then be decoded.

    You can also use word embeddings alongside a Bidirectional LSTM. This is the most computationally intensive option but if your other options are not working this might be a good choice. biLSTMs try to interpret sentences by looking at the context around words to try to understand what the word might mean in that context.

    Hope this helps

提交回复
热议问题