问题
I'm using an open source method that parses the html text into an NSString.
The resulting strings have large amounts of white space between the first couple of paragraphs, but only one line of space for subsequent paragraphs. Here is an example of an output.

stopCharacters
and newLineAndWhitespaceCharacters
, I removed /n
from the characterset because when it was included, the entire text was one long paragraph.
- (NSString *)stringByConvertingHTMLToPlainText {
// Pool
NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init];
// Character sets
NSCharacterSet *stopCharacters = [NSCharacterSet characterSetWithCharactersInString:[NSString stringWithFormat:@"< \t\r%C%C%C%C", 0x0085, 0x000C, 0x2028, 0x2029]];
NSCharacterSet *newLineAndWhitespaceCharacters = [NSCharacterSet characterSetWithCharactersInString:[NSString stringWithFormat:@" \t\r%C%C%C%C", 0x0085, 0x000C, 0x2028, 0x2029]];
NSCharacterSet *tagNameCharacters = [NSCharacterSet characterSetWithCharactersInString:@"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"];
// Scan and find all tags
NSMutableString *result = [[NSMutableString alloc] initWithCapacity:self.length];
NSScanner *scanner = [[NSScanner alloc] initWithString:self];
[scanner setCharactersToBeSkipped:nil];
[scanner setCaseSensitive:YES];
NSString *str = nil, *tagName = nil;
BOOL dontReplaceTagWithSpace = NO;
do {
// Scan up to the start of a tag or whitespace
if ([scanner scanUpToCharactersFromSet:stopCharacters intoString:&str]) {
[result appendString:str];
str = nil; // reset
}
// Check if we've stopped at a tag/comment or whitespace
if ([scanner scanString:@"<" intoString:NULL]) {
// Stopped at a comment or tag
if ([scanner scanString:@"!--" intoString:NULL]) {
// Comment
[scanner scanUpToString:@"-->" intoString:NULL];
[scanner scanString:@"-->" intoString:NULL];
} else {
// Tag - remove and replace with space unless it's
// a closing inline tag then dont replace with a space
if ([scanner scanString:@"/" intoString:NULL]) {
// Closing tag - replace with space unless it's inline
tagName = nil; dontReplaceTagWithSpace = NO;
if ([scanner scanCharactersFromSet:tagNameCharacters intoString:&tagName]) {
tagName = [tagName lowercaseString];
dontReplaceTagWithSpace = ([tagName isEqualToString:@"a"] ||
[tagName isEqualToString:@"b"] ||
[tagName isEqualToString:@"i"] ||
[tagName isEqualToString:@"q"] ||
[tagName isEqualToString:@"span"] ||
[tagName isEqualToString:@"em"] ||
[tagName isEqualToString:@"strong"] ||
[tagName isEqualToString:@"cite"] ||
[tagName isEqualToString:@"abbr"] ||
[tagName isEqualToString:@"acronym"] ||
[tagName isEqualToString:@"label"]);
}
// Replace tag with string unless it was an inline
if (!dontReplaceTagWithSpace && result.length > 0 && ![scanner isAtEnd]) [result appendString:@" "];
}
// Scan past tag
[scanner scanUpToString:@">" intoString:NULL];
[scanner scanString:@">" intoString:NULL];
}
} else {
// Stopped at whitespace - replace all whitespace and newlines with a space
if ([scanner scanCharactersFromSet:newLineAndWhitespaceCharacters intoString:NULL]) {
if (result.length > 0 && ![scanner isAtEnd]) [result appendString:@" "]; // Dont append space to beginning or end of result
}
}
} while (![scanner isAtEnd]);
// Cleanup
[scanner release];
// Decode HTML entities and return
NSString *retString = [[result stringByDecodingHTMLEntities] retain];
[result release];
// Drain
[pool drain];
// Return
return [retString autorelease];
}
EDIT:
Here is the NSLog of the string. I only pasted the first few paragraphs
Mitt Romney spent the past six years running for president. After his loss to President Barack Obama, he'll have to chart a different course.
His initial plan: spend time with his family. He has five sons and 18 grandchildren, with a 19th on the way.
"I don't look at postelection to be a time of regrouping. Instead it's a time of forward focus," Romney told reporters aboard his plane Tuesday evening as he returned to Boston after the final campaign stop of his political career. "I have, of course, a family and life important to me, win or lose."
The most visible member of that family — wife Ann Romney — says neither she nor her husband will seek political office again.
etc....
for (int j = 25; j< 50; j++) {
char test = [completeTrimmed characterAtIndex:([completeTrimmed rangeOfString:@"chart a different course."].location + j)];
NSLog(@"%hhd", test);
}
012-11-11 17:15:57.668 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.669 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.669 LMU_LAL_LAUNCHER[5431:c07] 10
2012-11-11 17:15:57.670 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.670 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.670 LMU_LAL_LAUNCHER[5431:c07] 10
2012-11-11 17:15:57.671 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.671 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.671 LMU_LAL_LAUNCHER[5431:c07] 10
2012-11-11 17:15:57.672 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.672 LMU_LAL_LAUNCHER[5431:c07] 72
2012-11-11 17:15:57.672 LMU_LAL_LAUNCHER[5431:c07] 105
2012-11-11 17:15:57.673 LMU_LAL_LAUNCHER[5431:c07] 115
2012-11-11 17:15:57.673 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.673 LMU_LAL_LAUNCHER[5431:c07] 105
2012-11-11 17:15:57.673 LMU_LAL_LAUNCHER[5431:c07] 110
2012-11-11 17:15:57.674 LMU_LAL_LAUNCHER[5431:c07] 105
2012-11-11 17:15:57.674 LMU_LAL_LAUNCHER[5431:c07] 116
2012-11-11 17:15:57.674 LMU_LAL_LAUNCHER[5431:c07] 105
2012-11-11 17:15:57.675 LMU_LAL_LAUNCHER[5431:c07] 97
2012-11-11 17:15:57.675 LMU_LAL_LAUNCHER[5431:c07] 108
2012-11-11 17:15:57.675 LMU_LAL_LAUNCHER[5431:c07] 32
2012-11-11 17:15:57.675 LMU_LAL_LAUNCHER[5431:c07] 112
2012-11-11 17:15:57.676 LMU_LAL_LAUNCHER[5431:c07] 108
2012-11-11 17:15:57.676 LMU_LAL_LAUNCHER[5431:c07] 97
回答1:
I have tried with the question above and this is how I fixed it,
NSString *retString = [[result stringByDecodingHTMLEntities] retain];
[result release];
retString = [retString stripDuplicateCharactersInSet:[NSCharacterSet whitespaceCharacterSet] withString:@" "];
retString = [retString stripDuplicateCharactersInSet:[NSCharacterSet newlineCharacterSet] withString:@"\n"];
I have defined a category method on NSString as,
- (NSString *)stripDuplicateCharactersInSet:(NSCharacterSet *)characterSet withString:(NSString *)joiningString;
The implementation is as follows,
- (NSString *)stripDuplicateCharactersInSet:(NSCharacterSet *)characterSet withString:(NSString *)joiningString {
NSMutableString *originalStr = [NSMutableString string];
if (!self) {
return nil;
}
NSArray *componentsArray = [self componentsSeparatedByCharactersInSet:characterSet];
int counter = 0;
for (NSString *stringComponent in componentsArray) {
counter ++;
if ((stringComponent) && ([stringComponent length] > 0) && (![stringComponent isEqualToString:@" "]) && ((![stringComponent isEqualToString:@"\n"]) || (![joiningString isEqualToString:@"\n"]))) {
if ([componentsArray count] == counter) {
[originalStr appendFormat:@"%@", stringComponent];
} else {
[originalStr appendFormat:@"%@%@", stringComponent, joiningString];
}
}
}
return originalStr;
}
Add the above method in NSString+HTML.m
file as a category on NSString
. Basically in the html given by you, white spaces and newline were getting mixed multiple times, and trying to strip newline alone was not working. So I am removing duplicate newlines and white spaces as shown above by comparing if the string has newline or whitespace after stripping and then appending on to main string.
Alternatively, you can also try as,
NSString *retString = [[result stringByDecodingHTMLEntities] retain];
[result release];
retString = [retString stripDuplicateNewlineCharacters];
The method is defined as,
- (NSString *)stripDuplicateNewlineCharacters {
NSMutableString *originalStr = [NSMutableString string];
if (!self) {
return nil;
}
NSArray *componentsArray = [self componentsSeparatedByCharactersInSet:[NSCharacterSet newlineCharacterSet]];
int counter = 0;
for (NSString *stringComponent in componentsArray) {
counter ++;
stringComponent = [stringComponent stringByReplacingOccurrencesOfString:@" " withString:@"<#$%$#>"];
stringComponent = [stringComponent stringByReplacingOccurrencesOfString:@"<#$%$#><#$%$#>" withString:@"<#$%$#>"];
stringComponent = [stringComponent stringByReplacingOccurrencesOfString:@"<#$%$#>" withString:@" "];
if ((stringComponent) && ([stringComponent length] > 0) && (![stringComponent isEqualToString:@" "]) && (![stringComponent isEqualToString:@"\n"])) {
if ([componentsArray count] == counter) {
[originalStr appendFormat:@"%@", stringComponent];
} else {
[originalStr appendFormat:@"%@\n", stringComponent];
}
}
}
return originalStr;
}
In this case, the duplicate white spaces are removed in the method itself while removing new line characters.
回答2:
Check with this,
//Decode HTML entities and return
NSString *retString = [result stringByDecodingHTMLEntities];
[result release];
//Drain
[pool drain];
retString = [[retString stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]] retain];
//Return
return [retString autorelease];
}
If the above is not working, Also try with
completeTrimmed = [completeTrimmed stringByReplacingOccurrencesOfString:@"\n" withString:@""];
and
completeTrimmed = [completeTrimmed stringByReplacingOccurrencesOfString:@"\r" withString:@""];
回答3:
You could replace @"/n/n" with @"/n" to reduce the number of line breaks.
来源:https://stackoverflow.com/questions/13283172/open-source-html-parsing-class-not-properly-parsing-spaces-between-paragraphs