I have this text file that I read into a Java application and then count the words in it line by line. Right now I am splitting the lines into words by a
St
You have one small mistake in your regex. Try this:
String[] Res = Text.split("[\\p{Punct}\\s]+");
[\\p{Punct}\\s]+
move the +
form inside the character class to the outside. Other wise you are splitting also on a +
and do not combine split characters in a row.
So I get for this code
String Text = "But I know. For example, the word \"can\'t\" should";
String[] Res = Text.split("[\\p{Punct}\\s]+");
System.out.println(Res.length);
for (String s:Res){
System.out.println(s);
}
this result
10
But
I
know
For
example
the
word
can
t
should
Which should meet your requirement.
As an alternative you can use
String[] Res = Text.split("\\P{L}+");
\\P{L}
means is not a unicode code point that has the property "Letter"
Well, seeing you want to count can't as two words , try
split("\\b\\w+?\\b")
http://www.regular-expressions.info/wordboundaries.html
Try:
line.split("[\\.,\\s!;?:\"]+");
or "[\\.,\\s!;?:\"']+"
This is an or match of one of these characters: ., !;?:"'
(note that there is a space in there but no / or \) the + causes several chars together to be counted as one.
That should give you a mostly sufficient accuracy.
More precise regexes would need more information about the type of text you need to parse, because ' can be a word delimiter as well. Mostly the most punctuation word delimiters are around a whitespace so matching on [\\s]+
would be a close approximation as well. (but gives the wrong count on short quotations like: She said:"no".)
There's a non-word literal, \W
, see Pattern.
String line = "Hello! this is a line. It can't be hard to split into \"words\", can it?";
String[] words = line.split("\\W+");
for (String word : words) System.out.println(word);
gives
Hello
this
is
a
line
It
can
t
be
hard
to
split
into
words
can
it