Find longest sequence of common words from list of words in python

£可爱£侵袭症+ 提交于 2020-06-23 05:15:42

问题


I searched a lot for a solution and I indeed have found similar questions. This answer gives back the longest sequence of CHARACTERS that might NOT belong in all of the strings in the input list. This answer gives back the longest common sequences of WORDS that MUST belong to all of the strings in the input list.

I am looking for a combination of the above solutions. That is, I want the longest sequence of common WORDS that might NOT appear in all of the words/phrases of the input list.

Here are some examples of what is expected:

['exterior lighting', 'interior lighting'] --> 'lighting'

['ambient lighting', 'ambient light'] --> 'ambient'

['led turn signal lamp', 'turn signal lamp', 'signal and ambient lamp', 'turn signal light'] --> 'turn signal lamp'

['ambient lighting', 'infrared light'] --> ''

Thank you


回答1:


this code will also sort your desired list by the most common word in your list. it will count the amount of every word in your list, and than will cut the words that appeared only once and sort it.

lst=['led turn signal lamp', 'turn signal lamp', 'signal and ambient lamp', 'turn signal light'] 
d = {}
d_words={}
for i in lst:
    for j in i.split():
      if j in d:
          d[j] = d[j]+1
      else:
          d[j]= 1
for k,v in d.items():
    if v!=1:
        d_words[k] = v
sorted_words = sorted(d_words,key= d_words.get,reverse = True)
print(sorted_words)



回答2:


A rather crude solution but I think it works:

from nltk.util import everygrams
import pandas as pd

def get_word_sequence(phrases):

    ngrams = []

    for phrase in phrases:        
        phrase_split = [ token for token in phrase.split()]
        ngrams.append(list(everygrams(phrase_split)))

    ngrams = [i for j in ngrams for i in j]  # unpack it    

    counts_per_ngram_series = pd.Series(ngrams).value_counts()

    counts_per_ngram_df = pd.DataFrame({'ngram':counts_per_ngram_series.index, 'count':counts_per_ngram_series.values})

    # discard the pandas Series
    del(counts_per_ngram_series)

    # filter out the ngrams that appear only once
    counts_per_ngram_df = counts_per_ngram_df[counts_per_ngram_df['count'] > 1]

    if not counts_per_ngram_df.empty:    
        # populate the ngramsize column
        counts_per_ngram_df['ngramsize'] = counts_per_ngram_df['ngram'].str.len()

        # sort by ngramsize, ngram_char_length and then by count
        counts_per_ngram_df.sort_values(['ngramsize', 'count'], inplace = True, ascending = [False, False])

        # get the top ngram
        top_ngram = " ".join(*counts_per_ngram_df.head(1).ngram.values)

        return top_ngram

    return ''


来源:https://stackoverflow.com/questions/62149929/find-longest-sequence-of-common-words-from-list-of-words-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!