splitting a list of sentences into separate words in a list

后端 未结 4 415
失恋的感觉
失恋的感觉 2021-01-16 18:33

I have a list which consists of lines as

lines =  [\'The query complexity of estimating weighted averages.\',
     \'New bounds for the query complexity of a         


        
4条回答
  •  既然无缘
    2021-01-16 19:05

    You can do it by:

    import nltk
    nltk.download('punkt')
    from nltk.tokenize import word_tokenize
    
    lines =  ['The query complexity of estimating weighted averages.',
     'New bounds for the query complexity of an algorithm that learns',
     'DFAs with correction equivalence queries.',
     'general procedure to check conjunctive query containment.']
    
    joint_words = ' '.join(lines)
    
    separated_words = word_tokenize(joint_words)
    
    print(separated_words)
    

    Output will be :

    ['The', 'query', 'complexity', 'of', 'estimating', 'weighted', 'averages', '.', 'New', 'bounds', 'for', 'the', 'query', 'complexity', 'of', 'an', 'algorithm', 'that', 'learns', 'DFAs', 'with', 'correction', 'equivalence', 'queries', '.', 'general', 'procedure', 'to', 'check', 'conjunctive', 'query', 'containment', '.']
    

    In addition, if you want to merge the dots with previous string (which appear as independent strings in the list), run the following code:

    for i, j in enumerate(separated_words):
        if '.' in j:
            separated_words[i-1] = separated_words[i-1] + separated_words[i]
            del separated_words[i]    # For deleting duplicate entry
    
    print(separated_words)
    

    Output will be:

    ['The', 'query', 'complexity', 'of', 'estimating', 'weighted', 'averages.', 'New', 'bounds', 'for', 'the', 'query', 'complexity', 'of', 'an', 'algorithm', 'that', 'learns', 'DFAs', 'with', 'correction', 'equivalence', 'queries.', 'general', 'procedure', 'to', 'check', 'conjunctive', 'query', 'containment.']
    

提交回复
热议问题