How to split a sentence into words and punctuations in java

六月ゝ 毕业季﹏ 提交于 2021-01-29 06:52:38

问题


I want to split a given sentence of type string into words and I also want punctuation to be added to the list.

For example, if the sentence is: "Sara's dog 'bit' the neighbor."
I want the output to be: [Sara's, dog, ', bit, ', the, neighbour, .]

With string.split(" ") I can split the sentence in words by space, but I want the punctuation also to be in the result list.

    String text="Sara's dog 'bit' the neighbor."  
    String list = text.split(" ")
    the printed result is [Sara's, dog,'bit', the, neighbour.]
    I don't know how to combine another regex with the above split method to separate punctuations also.

Some of the reference I have already tried but didn't work out

1.Splitting strings through regular expressions by punctuation and whitespace etc in java

2.How to split sentence to words and punctuation using split or matcher?

Example input and outputs

String input1="Holy cow! screamed Jane."

String[] output1 = [Holy,cow,!,screamed,Jane,.] 

String input2="Select your 'pizza' topping {pepper and tomato} follow me."

String[] output2 = [Select,your,',pizza,',topping,{,pepper,and,tomato,},follow,me,.]

回答1:


Instead of trying to come up with a pattern to split on, this challenge is easier to solve by coming up with a pattern of the elements to capture.

Although it's more code than a simple split(), it can still be done in a single statement in Java 9+:

String regex = "[\\p{L}\\p{M}\\p{N}]+(?:\\p{P}[\\p{L}\\p{M}\\p{N}]+)*|[\\p{P}\\p{S}]";
String[] parts = Pattern.compile(regex).matcher(s).results().map(MatchResult::group).toArray(String[]::new);

In Java 8 or earlier, you would write it like this:

List<String> parts = new ArrayList<>();
Matcher m = Pattern.compile(regex).matcher(s);
while (m.find()) {
    parts.add(m.group());
}

Explanation

\p{L} is Unicode letters, \\p{N} is Unicode numbers, and \\p{M} is Unicode marks (e.g. accents). Combined, they are here treated as characters in a "word".

\p{P} is Unicode punctuation. A "word" can have single punctuation characters embedded inside the word. The pattern before | matches a "word", given that definition.

\p{S} is Unicode symbol. Punctuation that is not embedded inside a "word", and symbols, are matched individually. That is the pattern after the |.

That leaves Unicode categories Z (separator) and C (other) uncovered, which means that any such character is skipped.

Test

public class Test {
    public static void main(String[] args) {
        test("Sara's dog 'bit' the neighbor.");
        test("Holy cow! screamed Jane.");
        test("Select your 'pizza' topping {pepper and tomato} follow me.");
    }
    private static void test(String s) {
        String regex = "[\\p{L}\\p{M}\\p{N}]+(?:\\p{P}[\\p{L}\\p{M}\\p{N}]+)*|[\\p{P}\\p{S}]";
        String[] parts = Pattern.compile(regex).matcher(s).results().map(MatchResult::group).toArray(String[]::new);
        System.out.println(Arrays.toString(parts));
    }
}

Output

[Sara's, dog, ', bit, ', the, neighbor, .]
[Holy, cow, !, screamed, Jane, .]
[Select, your, ', pizza, ', topping, {, pepper, and, tomato, }, follow, me, .]



回答2:


Arrays.stream( s.split("((?<=[\\s\\p{Punct}])|(?=[\\s\\p{Punct}]))") )
.filter(ss -> !ss.trim().isEmpty())
.collect(Collectors.toList())

Reference:

How to split a string, but also keep the delimiters?

Regular Expressions on Punctuation




回答3:


ArrayList<String> chars = new ArrayList<String>();
String str = "Hello my name is bob";
String tempStr = "";
for(String cha : str.toCharArray()){
  if(cha.equals(" ")){
    chars.add(tempStr);
    tempStr = "";
  }
  //INPUT WHATEVER YOU WANT FOR PUNCTATION WISE
  else if(cha.equals("!") || cha.equals(".")){
    chars.add(cha);
  }
  else{
    tempStr = tempStr + cha;
  }
}
chars.add(str.substring(str.lastIndexOf(" "));

That? It should add every single word, assuming there is spaces for each word in the sentence. for !'s, and .'s, you would have to do a check for that as well. Quite simple.



来源:https://stackoverflow.com/questions/57858033/how-to-split-a-sentence-into-words-and-punctuations-in-java

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!