Swift string indexing combines “\r\n” as one char instead of two

假装没事ソ 提交于 2021-02-10 03:02:55

问题


I am dealing with strings containing \r\n with Swift 4.2. I ran into kind of strange behavior of Swift index, it appears \r\n will be treated as one character instead of two by Swift indexing methods. I wrote a piece of code to present this behavior:

var text = "ABC\r\n\r\nDEF"

func printChar(_ lower: Int, _ upper: Int) {
    let start = text.index(text.startIndex, offsetBy: lower)
    let end = text.index(text.startIndex, offsetBy: upper)
    print("\"" + text[start..<end] + "\"")
}

printChar(0, 1) // "A"
printChar(1, 2) // "B"
printChar(2, 3) // "C"
printChar(3, 4) // new line
printChar(4, 5) // new line (okay, what's going on here?)
printChar(5, 6) // "D"
printChar(6, 7) // "E"
printChar(7, 8) // "F"

The print result will be

"A"
"B"
"C"
"
"
"
"
"D"
"E"
"F"

Any idea why it's like this?


回答1:


TLDR: \r\n is a grapheme cluster and is treated as a single Character in Swift because Unicode.


  • Swift treats \r\n as one Character.

  • Objective-C NSString treats it as two characters (in terms of the result from length).

On the swift-users forum someone wrote:

– "\r\n" is a single Character. Is this the correct behaviour?

– Yes, a Character corresponds to a Unicode grapheme cluster, and "\r\n" is considered a single grapheme cluster.

And the subsequent response posted a link to Unicode documentation, check out this table which officially states CRLF is a grapheme cluster.

Take a look at the Apple documentation on Characters and Grapheme Clusters.

It's common to think of a string as a sequence of characters, but when working with NSString objects, or with Unicode strings in general, in most cases it is better to deal with substrings rather than with individual characters. The reason for this is that what the user perceives as a character in text may in many cases be represented by multiple characters in the string.

The Swift documentation on Strings and Characters is also worth reading.

This overview from objc.io is interesting as well.

NSString represents UTF-16-encoded text. Length, indices, and ranges are all based on UTF-16 code units.

Another example of this is an emoji like 👍🏻. This single character is actually %uD83D%uDC4D%uD83C%uDFFB, four different unicode scalars. But if you called count on a string with just that emoji you'd (correctly) get 1.

If you wanted to see the scalars you could iterate them as follows:

for scalar in text.unicodeScalars {
    print("\(scalar.value) ", terminator: "")
}

Which for "\r\n" would give you 13 10

In the Swift documentation you'll find why NSString is different:

The count of the characters returned by the count property isn’t always the same as the length property of an NSString that contains the same characters. The length of an NSString is based on the number of 16-bit code units within the string’s UTF-16 representation and not the number of Unicode extended grapheme clusters within the string.

Thus this isn't really "strange" behaviour of Swift string indexing, but rather a result of how Unicode treats these characters and how String in Swift is designed. Swift string indexing goes by Character and \r\n is a single Character.



来源:https://stackoverflow.com/questions/53940147/swift-string-indexing-combines-r-n-as-one-char-instead-of-two

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!