Extract words out of a text file

后端 未结 5 1114
春和景丽
春和景丽 2020-12-28 20:37

Let\'s say you have a text file like this one: http://www.gutenberg.org/files/17921/17921-8.txt

Does anyone has a good algorithm, or open-source code, to extract wor

5条回答
  •  天涯浪人
    2020-12-28 21:13

    Pseudocode would look like this:

    create words, a list of words, by splitting the input by whitespace
    for every word, strip out whitespace and punctuation on the left and the right
    

    The python code would be something like this:

    words = input.split()
    words = [word.strip(PUNCTUATION) for word in words]
    

    where

    PUNCTUATION = ",. \n\t\\\"'][#*:"
    

    or any other characters you want to remove.

    I believe Java has equivalent functions in the String class: String.split() .


    Output of running this code on the text you provided in your link:

    >>> print words[:100]
    ['Project', "Gutenberg's", 'Manual', 'of', 'Surgery', 'by', 'Alexis', 
    'Thomson', 'and', 'Alexander', 'Miles', 'This', 'eBook', 'is', 'for', 
    'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and', 
    'with', 'almost', 'no', 'restrictions', 'whatsoever', 'You', 'may', 
    'copy', 'it', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 
    ... etc etc.
    

提交回复
热议问题