Word Wrapping with Regular Expressions

末鹿安然 提交于 2019-11-27 02:19:03

问题


EDIT FOR CLARITY - I know there are ways to do this in multiple steps, or using LINQ or vanilla C# string manipulation. The reason I am using a single regex call, is because I wanted practice with complex regex patterns. - END EDIT

I am trying to write a single regular expression that will perform word wrapping. It's extremely close to the desired output, but I can't quite get it to work.

Regex.Replace(text, @"(?<=^|\G)(.{1,20}(\s|$))", "$1\r\n", RegexOptions.Multiline)

This is correctly wrapping words for lines that are too long, but it's adding a line break when there already is one.

Input

"This string is really long. There are a lot of words in it.\r\nHere's another line in the string that's also very long."

Expected Output

"This string is \r\nreally long. There \r\nare a lot of words \r\nin it.\r\nHere's another line \r\nin the string that's \r\nalso very long."

Actual Output

"This string is \r\nreally long. There \r\nare a lot of words \r\nin it.\r\n\r\nHere's another line \r\nin the string that's \r\nalso very long.\r\n"

Note the double "\r\n" between sentences where the input already had a line break and the extra "\r\n" that was put at the end.

Perhaps there's a way to conditionally apply different replacement patterns? I.E. If the match ends in "\r\n", use replace pattern "$1", otherwise, use replace pattern "$1\r\n".

Here's a link to a similar question for wrapping a string with no white space that I used as a starting point. Regular expression to find unbroken text and insert space


回答1:


This was quick-tested in Perl.

Edit - This regex code simulates the word wrap used (good or bad) in MS-Windows Notepad.exe

 # MS-Windows  "Notepad.exe Word Wrap" simulation
 # ( N = 16 )
 # ============================
 # Find:     @"(?:((?>.{1,16}(?:(?<=[^\S\r\n])[^\S\r\n]?|(?=\r?\n)|$|[^\S\r\n]))|.{1,16})(?:\r?\n)?|(?:\r?\n|$))"
 # Replace:  @"$1\r\n"
 # Flags:    Global     

 # Note - Through trial and error discovery, it apparears Notepad accepts an extra whitespace
 # (possibly in the N+1 position) to help alignment. This matters not because thier viewport hides it.
 # There is no trimming of any whitespace, so the wrapped buffer could be reconstituted by inserting/detecting a
 # wrap point code which is different than a linebreak.
 # This regex works on un-wrapped source, but could probably be adjusted to produce/work on wrapped buffer text.
 # To reconstitute the source all that is needed is to remove the wrap code which is probably just an extra "\r".

 (?:
      # -- Words/Characters 
      (                       # (1 start)
           (?>                     # Atomic Group - Match words with valid breaks
                .{1,16}                 #  1-N characters
                                        #  Followed by one of 4 prioritized, non-linebreak whitespace
                (?:                     #  break types:
                     (?<= [^\S\r\n] )        # 1. - Behind a non-linebreak whitespace
                     [^\S\r\n]?              #      ( optionally accept an extra non-linebreak whitespace )
                  |  (?= \r? \n )            # 2. - Ahead a linebreak
                  |  $                       # 3. - EOS
                  |  [^\S\r\n]               # 4. - Accept an extra non-linebreak whitespace
                )
           )                       # End atomic group
        |  
           .{1,16}                 # No valid word breaks, just break on the N'th character
      )                       # (1 end)
      (?: \r? \n )?           # Optional linebreak after Words/Characters
   |  
      # -- Or, Linebreak
      (?: \r? \n | $ )        # Stand alone linebreak or at EOS
 )

Test Case The wrap width N is 16. Output matches Notepad's and over a variety of widths.

 $/ = undef;

 $string1 = <DATA>;

 $string1 =~ s/(?:((?>.{1,16}(?:(?<=[^\S\r\n])[^\S\r\n]?|(?=\r?\n)|$|[^\S\r\n]))|.{1,16})(?:\r?\n)?|(?:\r?\n|$))/$1\r\n/g;

 print $string1;

 __DATA__
 hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
 bbbbbbbbbbbbbbbbEDIT FOR CLARITY - I                    know there are  ways to do this in   multiple steps, or using LINQ or vanilla C#
 string manipulation. 

 The reason I am using a single regex call, is because I wanted practice. with complex
 regex patterns. - END EDIT
 pppppppppppppppppppUf

Output >>

 hhhhhhhhhhhhhhhh
 hhhhhhhhhhhhhhh
 bbbbbbbbbbbbbbbb
 EDIT FOR CLARITY 
 - I              
       know there 
 are  ways to do 
 this in   
 multiple steps, 
 or using LINQ or 
 vanilla C#
 string 
 manipulation. 

 The reason I am 
 using a single 
 regex call, is 
 because I wanted 
 practice. with 
 complex
 regex patterns. 
 - END EDIT
 pppppppppppppppp
 pppUf



回答2:


I would write an extension method like this.

var input = "This string is really long. There are a lot of words in it.\r\nHere's another line in the string that's also very long.";

var lines = input.SplitByLength(20).ToList();

public static partial class MyExtensions
{
    public static  IEnumerable<string> SplitByLength(this string input, int maxLen)
    {
        return Regex.Split(input, @"(.{1," + maxLen + @"})(?:\s|$)")
                    .Where(x => x.Length > 0)
                    .Select(x => x.Trim());
    }
}

OUTPUT

This string is
really long. There
are a lot of words
in it.
Here's another line
in the string that's
also very long.



回答3:


Add a place holder for the '\r\n' in the first pass, then replace any \r\n'placeholder' values with \r\n, finally make a third pass and replace the left over placeholders with \r\n.

For example using \u0000 as the placeholder

This of course only works if your original strings don't contain null

    string text = "This string is really long. There are a lot of words in it.\r\nHere's another line in the string that's also very long.";
    Console.WriteLine(text);

    text = Regex.Replace(text, @"(?<=^|\G)(.{1,20}(\s|$))", "$1\u0000", RegexOptions.Multiline);
    // break added after original
    text = Regex.Replace(text, "\r\n\u0000", "\r\n", RegexOptions.Multiline);
    text = Regex.Replace(text, "\u0000", "\r\n", RegexOptions.Multiline);
    Console.WriteLine(text);



回答4:


Since you don't indicate what you want to have happen if a single word is longer than the number of characters to wordwrap, I chose to split at the maximum number of characters (20 in this case) if a word is longer than 20:

resultString = Regex.Replace(subjectString, @"(.{1,19}\S)(?:\s+|$)|(.{20})", @"$1$2
", RegexOptions.Multiline);

After the $1$2 there is a LF, not sure how it will show up here. You may be able to insert \r\n there, but that somehow doesn't work on my emulator:

resultString = Regex.Replace(subjectString, @"(.{1,19}\S)(?:\s+|$)|(.{20})", @"$1$2\r\n", RegexOptions.Multiline);    



回答5:


Here's a solution that combines some of these good ideas. I wrote a regex from scratch and found it is very similar to the one provided by sln, but it's a little shorter and probably does less backtracking:

# assuming a max line length of 16
(?:
    [^\r\n]{1,16}(?=\s|$)       # non-linebreaking characters followed by a space 
                                #    or end-of-string, up to the max line length
    |[^\r\n]{16}                # Or for really long words: a sequence of non-breaking  
                                #    characters exactly the line length
    |(?<=\n)\r?\n               # Or blank lines: a line break following another line break.  This works for \n or \r\n styles.
)

Like L.B I put the regex in a extension method, WordWrap:

void Main()
{
    var lineLen = 25;
    var test1 = "Some random words like calendar boat and breathe.\nAnd an extra line.\n\n\nAnd here's one that has to break in the middle because there are no spaces:\n"
        + String.Join("", Enumerable.Range(1, lineLen + 5).Select(i => (i % 10).ToString()));

    var test2 = test1.Replace("\n","\r\n");

    StringHelper.StringRuler(lineLen).Dump("ruler");
    String.Join("\n", test1.WordWrap(lineLen)).Dump("test 1");
    String.Join("\r\n", test2.WordWrap(lineLen)).Dump("test 2");
}

public static class StringHelper {

    public static IEnumerable<String> WordWrap(this string source, int lineLength) {
        return new Regex(
            @"(?:[^\r\n]{1,lineLength}(?=\s|$)|[^\r\n]{lineLength}|(?<=\n)\r?\n)"
                .Replace("lineLength", lineLength.ToString()))
            .Matches(source)
            .Cast<Match>()  // http://stackoverflow.com/a/7274451/555142
            .Select(m=>m.Value.Trim());
    }

    public static string StringRuler(int lineLength) {
        return 
            String.Join("", Enumerable.Range(1, lineLength)
                .Select(i => ((i % 10) == 0 ? (i / 10).ToString() : " "))) + "\n" 
            + String.Join("", Enumerable.Range(1, lineLength).Select(i => (i % 10).ToString())) + "\n" 
            + String.Join("", Enumerable.Range(1, lineLength).Select(i => "-")); 
    }

}

Testing with LinqPad (Instant Share). There are two tests, the first for \n line breaks and the second for \r\n line breaks.

ruler

         1         2     
1234567890123456789012345
------------------------- 

test 1

Some random words like
calendar boat and
breathe.
And an extra line.


And here's one that has
to break in the middle
because there are no
spaces:
1234567890123456789012345
67890 


test 2

Some random words like
calendar boat and
breathe.
And an extra line.


And here's one that has
to break in the middle
because there are no
spaces:
1234567890123456789012345
67890 



回答6:


My solution in JS:

function wordWrap(s, width) {
  var r = '(?:(.{1,' + width + '})[ \\r\\t]+|(.{' + width + '}))(?!$)';
  r = new RegExp(r, 'g');
  // console.log(r);
  return s.replace(r, '$1$2\n');
}


来源:https://stackoverflow.com/questions/20431801/word-wrapping-with-regular-expressions

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!