NSRegularExpression separating paragraphs

陌路散爱 提交于 2020-01-05 12:02:49

问题


Consider this text:

Paragraph 1: Lorem ipsum dolor sit amet, consectetur adipisicing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut
enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi
ut aliquip ex ea commodo consequat. 

Paragraph 2 Lorem ipsum dolor sit amet, consectetur adipisicing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut
enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi
ut aliquip ex ea commodo consequat.







Paragraph 3 Lorem ipsum dolor sit amet, consectetur adipisicing elit,
sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut
enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi
ut aliquip ex ea commodo consequat.

In ObjC, when reading the above text, there are two \n\n line spaces between paragraph1 and paragraph2. But there are more than 3 line spaces \n\n\n\n between paragraph2 and paragraph3.

I wanted to have an NSRegularExpression pattern that would read and return those paragraphs completely disregarding the number of linespaces.

NSString *pattern = @"\n(*\n)\n";

NSRegularExpression* regex1 = [[NSRegularExpression alloc] initWithPattern:pattern options:NSRegularExpressionCaseInsensitive error:nil];

NSArray *array = [regex1 matchesInString:p options:0 range:NSMakeRange(0, [p length])];
for(NSTextCheckingResult *tcr in array){
    NSTextCheckingResult *tcr = [regex1 firstMatchInString:p options:0 range:NSMakeRange(0, p.length)];
    NSRange matchRange = [tcr rangeAtIndex:1];
    NSString *amatch = [p substringWithRange:matchRange];
    NSLog(@"Found string: %@", amatch);
}

I'm new to NSRegularExpression, any reference to a better tutorial would be great. In this case and, is this the right way to go about it in the above question.


回答1:


The following does the job. I also used the enumerateMatchesInString to find matches.

NSString *pattern = @"(\\A|\\n\\s*\\n)(.*?\\S[\\s\\S]*?\\S)(?=(\\Z|\\s*\\n\\s*\\n))";
NSRegularExpression* regex = [[NSRegularExpression alloc] initWithPattern:pattern
                                                                  options:NSRegularExpressionCaseInsensitive
                                                                    error:&error];

[regex enumerateMatchesInString:input
                        options:0
                          range:NSMakeRange(0, [input length])
                     usingBlock:^(NSTextCheckingResult *result, NSMatchingFlags flags, BOOL *stop) {
                         NSString *match = [input substringWithRange:[result rangeAtIndex:2]];
                         NSLog(@"match = '%@'", match);
                     }];

This returns not only the strings between two newline characters (ignoring any extra whitespace between the returns), but also the first one (i.e. between the beginning of the string and the first sequence of two newlines) and the last one (i.e. between the last sequence of two newlines and the end of the string.




回答2:


You don't need NSRegularExpression to do this. There are a load of really useful natural language parsing functions built right in to NSString.

The best way to do this is to enumerate the string like this...

NSString *string = @"Paragraph 1: Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.\n\n\nParagraph 2 Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.\n\n\n\n\n\n\n\n\n\nParagraph 3 Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.";

NSMutableArray *paragraphs = [NSMutableArray array];

[string enumerateSubstringsInRange:NSMakeRange(0, string.length) 
                           options:NSStringEnumerationByParagraphs 
                        usingBlock:^(NSString *substring, NSRange substringRange, NSRange enclosingRange, BOOL *stop) {
    [paragraphs addObject:substring];
}];

for (NSString *paragraph in paragraphs) {
    NSLog(@"%@", paragraph);
}

This will take each paragraph and put it into the paragraphs NSMutableArray.

This doesn't require any parsing or Regular expressions etc... It will also probably be quicker than anything you can write as it's a native function.




回答3:


I believe it may be done more easier with standard NSString methods:

NSArray *allParagraphs = [text componentsSeparatedByString:@"\n\n"];

NSCharacterSet *charactersToTrim = [NSCharacterSet whitespaceAndNewlineCharacterSet];
for (NSString *paragraph in allParagraphs) {
    NSString *trimmedParagraph = 
            [paragraph stringByTrimmingCharactersInSet:charactersToTrim];
}

Or, if you want to use regexp, try something like this:

"(.*?)(\\n{2,}|$)"

It keep all symbols until it find two or more new lines or end of file

Edit.

NSRegularExpression *regexp =
        [NSRegularExpression regularExpressionWithPattern:@"(.*?)(\\n{2,}|$)"
                                                  options:NSRegularExpressionDotMatchesLineSeparators
                                                    error:nil];
[regexp enumerateMatchesInString:TEST_STRING
                         options:0
                           range:NSMakeRange(0, TEST_STRING.length)
                      usingBlock:^(NSTextCheckingResult *result, NSMatchingFlags flags, BOOL *stop){
                          NSLog(@"%@", [TEST_STRING substringWithRange:[result rangeAtIndex:1]]);
                      }];



回答4:


I cannot help you with the NSRegularExpression matching and replacing, but I believe the regex you are looking for is \\n(\\n)+.

You need to escape the newline character twice. Once for the C string and once for the regex. The + character means one or more of the previous group.



来源:https://stackoverflow.com/questions/14569074/nsregularexpression-separating-paragraphs

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!