Splitting strings through regular expressions by punctuation and whitespace etc in java

后端 未结 4 1142
感动是毒
感动是毒 2020-12-01 12:28

I have this text file that I read into a Java application and then count the words in it line by line. Right now I am splitting the lines into words by a

St         


        
相关标签:
4条回答
  • 2020-12-01 12:43

    You have one small mistake in your regex. Try this:

    String[] Res = Text.split("[\\p{Punct}\\s]+");
    

    [\\p{Punct}\\s]+ move the + form inside the character class to the outside. Other wise you are splitting also on a + and do not combine split characters in a row.

    So I get for this code

    String Text = "But I know. For example, the word \"can\'t\" should";
    
    String[] Res = Text.split("[\\p{Punct}\\s]+");
    System.out.println(Res.length);
    for (String s:Res){
        System.out.println(s);
    }
    

    this result

    10
    But
    I
    know
    For
    example
    the
    word
    can
    t
    should

    Which should meet your requirement.

    As an alternative you can use

    String[] Res = Text.split("\\P{L}+");
    

    \\P{L} means is not a unicode code point that has the property "Letter"

    0 讨论(0)
  • 2020-12-01 12:44

    Well, seeing you want to count can't as two words , try

    split("\\b\\w+?\\b")
    

    http://www.regular-expressions.info/wordboundaries.html

    0 讨论(0)
  • 2020-12-01 12:48

    Try:

    line.split("[\\.,\\s!;?:\"]+");
    or         "[\\.,\\s!;?:\"']+"
    

    This is an or match of one of these characters: ., !;?:"' (note that there is a space in there but no / or \) the + causes several chars together to be counted as one.

    That should give you a mostly sufficient accuracy. More precise regexes would need more information about the type of text you need to parse, because ' can be a word delimiter as well. Mostly the most punctuation word delimiters are around a whitespace so matching on [\\s]+ would be a close approximation as well. (but gives the wrong count on short quotations like: She said:"no".)

    0 讨论(0)
  • 2020-12-01 12:57

    There's a non-word literal, \W, see Pattern.

    String line = "Hello! this is a line. It can't be hard to split into \"words\", can it?";
    String[] words = line.split("\\W+");
    for (String word : words) System.out.println(word);
    

    gives

    Hello
    this
    is
    a
    line
    It
    can
    t
    be
    hard
    to
    split
    into
    words
    can
    it
    
    0 讨论(0)
提交回复
热议问题