NSAttributedString initWithHTML incorrect character encoding?

a 夏天 提交于 2019-12-04 16:02:06

问题


-[NSMutableAttributedString initWithHTML:documentAttributes:] seems to mangle special characters:

NSString *html = @"“Hello” World"; // notice the smart quotes
NSData *htmlData = [html dataUsingEncoding:NSUTF8StringEncoding];
NSMutableAttributedString *as = [[NSMutableAttributedString alloc] initWithHTML:htmlData documentAttributes:nil];
NSLog(@"%@", as);

That prints “Hello†World followed by some RTF commands. In my application, I convert the attributed string to RTF and display it in an NSTextView, but the characters are corrupted there, too.

According to the documentation, the default encoding is UTF-8, but I tried being explicit and the result is the same:

NSDictionary *attributes = @{NSCharacterEncodingDocumentAttribute: [NSNumber numberWithInt:NSUTF8StringEncoding]};
NSMutableAttributedString *as = [[NSMutableAttributedString alloc] initWithHTML:htmlData documentAttributes:&attributes];

回答1:


Use [html dataUsingEncoding:NSUnicodeStringEncoding] when creating the NSData and set the matching encoding option when you parse the HTML into an attributed string:

The documentation for NSCharacterEncodingDocumentAttribute is slightly confusing:

NSNumber, containing an int specifying the NSStringEncoding for the file; for reading and writing plain text files and writing HTML; default for plain text is the default encoding; default for HTML is UTF-8.

So, you code should be:

NSString *html = @"“Hello” World";
NSData *htmlData = [html dataUsingEncoding:NSUTF8StringEncoding];
NSDictionary *options = @{NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType,
                                    NSCharacterEncodingDocumentAttribute: @(NSUTF8StringEncoding)};
NSMutableAttributedString *as =
    [[NSMutableAttributedString alloc] initWithHTML:htmlData
                                            options: options
                                 documentAttributes:nil];



回答2:


The previous answer here works, but mostly by accident.

Making an NSData with NSUnicodeStringEncoding will tend to work, because that constant is an alias for NSUTF16StringEncoding, and UTF-16 is pretty easy for the system to identify. Easier than UTF-8, which apparently was being identified as some other superset of ASCII (it looks like NSWindowsCP1252StringEncoding in your case, probably because it's one of the few ASCII-based encodings with mappings for 0x8_ and 0x9_).

That answer is mistaken in quoting the documentation for NSCharacterEncodingDocumentAttribute, because "attributes" are what you get out of -initWithHTML. That's why it's NSDictionary ** and not just NSDictionary *. You can pass in a pointer to an NSDictionary *, and you'll get out keys like TopMargin/BottomMargin/LeftMargin/RightMargin, PaperSize, DocumentType, UTI, etc. Any values you try to pass in through the "attributes" dictionary are ignored.

You need to use "options" for passing values in, and the relevant option key is NSTextEncodingNameDocumentOption, which has no documented default value. It's passing the bytes to WebKit for parsing, so if you don't specify an encoding, presumably you're getting WebKit's encoding-guessing heuristics.

To guarantee the encoding types match between your NSData and NSAttributedString, what you should do is something like:

NSString *html = @"“Hello” World";
NSData *htmlData = [html dataUsingEncoding:NSUTF8StringEncoding];

NSMutableAttributedString *as =
    [[NSMutableAttributedString alloc] initWithHTML:htmlData
                                            options:@{NSTextEncodingNameDocumentOption: @"UTF-8"}
                                 documentAttributes:nil];


来源:https://stackoverflow.com/questions/15956698/nsattributedstring-initwithhtml-incorrect-character-encoding

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!