split a string in java into equal length substrings while maintaining word boundaries

后端 未结 2 726
生来不讨喜
生来不讨喜 2020-12-03 09:06

How to split a string into equal parts of maximum character length while maintaining word boundaries?

Say, for example, if I want to split a string \"hello world\" i

2条回答
  •  再見小時候
    2020-12-03 09:35

    If I understand your problem correctly then this code should do what you need (but it assumes that maxLenght is equal or greater than longest word)

    String data = "Hello there, my name is not importnant right now."
            + " I am just simple sentecne used to test few things.";
    int maxLenght = 10;
    Pattern p = Pattern.compile("\\G\\s*(.{1,"+maxLenght+"})(?=\\s|$)", Pattern.DOTALL);
    Matcher m = p.matcher(data);
    while (m.find())
        System.out.println(m.group(1));
    

    Output:

    Hello
    there, my
    name is
    not
    importnant
    right now.
    I am just
    simple
    sentecne
    used to
    test few
    things.
    

    Short (or not) explanation of "\\G\\s*(.{1,"+maxLenght+"})(?=\\s|$)" regex:

    (lets just remember that in Java \ is not only special in regex, but also in String literals, so to use predefined character sets like \d we need to write it as "\\d" because we needed to escape that \ also in string literal)

    • \G - is anchor representing end of previously founded match, or if there is no match yet (when we just started searching) beginning of string (same as ^ does)
    • \s* - represents zero or more whitespaces (\s represents whitespace, * "zero-or-more" quantifier)
    • (.{1,"+maxLenght+"}) - lets split it in more parts (at runtime :maxLenght will hold some numeric value like 10 so regex will see it as .{1,10})
      • . represents any character (actually by default it may represent any character except line separators like \n or \r, but thanks to Pattern.DOTALL flag it can now represent any character - you may get rid of this method argument if you want to start splitting each sentence separately since its start will be printed in new line anyway)
      • {1,10} - this is quantifier which lets previously described element appear 1 to 10 times (by default will try to find maximal amout of matching repetitions),
      • .{1,10} - so based on what we said just now, it simply represents "1 to 10 of any characters"
      • ( ) - parenthesis create groups, structures which allow us to hold specific parts of match (here we added parenthesis after \\s* because we will want to use only part after whitespaces)
    • (?=\\s|$) - is look-ahead mechanism which will make sure that text matched by .{1,10} will have after it:

      • space (\\s)

        OR (written as |)

      • end of the string $ after it.

    So thanks to .{1,10} we can match up to 10 characters. But with (?=\\s|$) after it we require that last character matched by .{1,10} is not part of unfinished word (there must be space or end of string after it).

提交回复
热议问题