Extract words out of a text file

浪子不回头ぞ 提交于 2019-11-29 01:30:51

问题


Let's say you have a text file like this one: http://www.gutenberg.org/files/17921/17921-8.txt

Does anyone has a good algorithm, or open-source code, to extract words from a text file? How to get all the words, while avoiding special characters, and keeping things like "it's", etc...

I'm working in Java. Thanks


回答1:


This sounds like the right job for regular expressions. Here is some Java code to give you an idea, in case you don't know how to start:

String input = "Input text, with words, punctuation, etc. Well, it's rather short.";
Pattern p = Pattern.compile("[\\w']+");
Matcher m = p.matcher(input);

while ( m.find() ) {
    System.out.println(input.substring(m.start(), m.end()));
}

The pattern [\w']+ matches all word characters, and the apostrophe, multiple times. The example string would be printed word-by-word. Have a look at the Java Pattern class documentation to read more.




回答2:


Pseudocode would look like this:

create words, a list of words, by splitting the input by whitespace
for every word, strip out whitespace and punctuation on the left and the right

The python code would be something like this:

words = input.split()
words = [word.strip(PUNCTUATION) for word in words]

where

PUNCTUATION = ",. \n\t\\\"'][#*:"

or any other characters you want to remove.

I believe Java has equivalent functions in the String class: String.split() .


Output of running this code on the text you provided in your link:

>>> print words[:100]
['Project', "Gutenberg's", 'Manual', 'of', 'Surgery', 'by', 'Alexis', 
'Thomson', 'and', 'Alexander', 'Miles', 'This', 'eBook', 'is', 'for', 
'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and', 
'with', 'almost', 'no', 'restrictions', 'whatsoever', 'You', 'may', 
'copy', 'it', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 
... etc etc.



回答3:


Here's a good approach to your problem: This function receives your text as an input and returns an array of all the words inside the given text

private ArrayList<String> get_Words(String SInput){

    StringBuilder stringBuffer = new StringBuilder(SInput);
    ArrayList<String> all_Words_List = new ArrayList<String>();

    String SWord = "";
    for(int i=0; i<stringBuffer.length(); i++){
        Character charAt = stringBuffer.charAt(i);
        if(Character.isAlphabetic(charAt) || Character.isDigit(charAt)){
            SWord = SWord + charAt;
        }
        else{
            if(!SWord.isEmpty()) all_Words_List.add(new String(SWord));
            SWord = "";
        }

    }

    return all_Words_List;

}



回答4:


Basically, you want to match

([A-Za-z])+('([A-Za-z])*)?

right?




回答5:


You could try regex, using a pattern you've made, and run a count the number of times that pattern has been found.



来源:https://stackoverflow.com/questions/276546/extract-words-out-of-a-text-file

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!