Is there any reasonable way to access the contents of a CharacterSet?

老子叫甜甜 提交于 2019-12-11 01:35:15

问题


For a random string generator, I thought it would be nice to use CharacterSet as input type for the alphabet to use, since the pre-defined sets such as CharacterSet.lowercaseLetters are obviously useful (even if they may contain more diverse character sets than you'd expect).

However, apparently you can only query character sets for membership, but not enumerate let alone index them. All we get is _.bitmapRepresentation, a 8kb chunk of data with an indicator bit for every (?) character. But even if you peel out individual bits by index i (which is less than nice, going through byte-oriented Data), Character(UnicodeScalar(i)) does not give the correct letter. Which means that the format is somewhat obscure -- and, of course, it's not documented.

Of course we can iterate over all characters (per plane) but that is a bad idea, cost-wise: a 20-character set may require iterating over tens of thousands of characters. Speaking in CS terms: bit-vectors are a (very) bad implementation for sparse sets. Why they chose to make the trade-off in this way here, I have no idea.

Am I missing something here, or is CharacterSet just another deadend in the Foundation API?


回答1:


By your definition, no, there is no "reasonable" way. That's just how NSCharacterSet stores it. It's optimized for testing membership, not enumerating all members.

Your loop can increment a counter over the codepoints, or it can shift the bits (one per codepoint), but either way you have to loop and test. The highest "Ll" character on my Mac is U+1D7CB (#120,779), so if you want to compute this list of characters at runtime, your code will have to loop at least that many times. See the Objective-C version of the documentation for details on how the bit vector is organized.

The good news is that this is fast. With unoptimized code on my 10-year-old Mac, it takes less than 1/10th of a second to find all 1,841 lowercaseLetters. If that's still not fast enough, it's easy to hide the cost by doing it once, in the background, at startup time.




回答2:


Following the documentation, here is an improvement on Satachito answer to support cases of non-continuous planes, by actually taking into account the plane index:

extension CharacterSet {
    func codePoints() -> [Int] {
        var result: [Int] = []
        var plane = 0
        // following documentation at https://developer.apple.com/documentation/foundation/nscharacterset/1417719-bitmaprepresentation
        for (i, w) in bitmapRepresentation.enumerated() {
            let k = i % 8193
            if k == 8192 {
                // plane index byte
                plane = Int(w) << 13
                continue
            }
            let base = (plane + k) << 3
            for j in 0 ..< 8 where w & 1 << j != 0 {
                result.append(base + j)
            }
        }
        return result
    }

    func printHexValues() {
        codePoints().forEach { print(String(format:"%02X", $0)) }
    }
}

Usage

print("whitespaces:")
CharacterSet.whitespaces.printHexValues()
print()
print("two characters from different planes:")
CharacterSet(charactersIn: "𝚨󌞑").printHexValues()

Results

whitespaces:
09
20
A0
1680
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
200A
200B
202F
205F
3000

two characters from different planes:
1D6A8
CC791

Performances

This is effectively 3 to 10 times faster than iterating over all characters: comparison is done with the previous answers at NSArray from NSCharacterset.




回答3:


bitmapRepresentation has been documented.

https://developer.apple.com/documentation/foundation/nscharacterset/1417719-bitmaprepresentation

So iterate over that Data like below:

var offset = 0
for ( var i, w ) in CharacterSet.whitespaces.bitmapRepresentation.enumerated() {
    if i % 8193 == 8192 {
        offset += 1
        continue
    }
    i -= offset
    if w != 0 {
        for j in 0 ..< 8 {
            if w & ( 1 << j ) != 0 {
                print( String( format:"%02X", i * 8 + j ) )
            }
        }
    }
}

Result:

09
20
A0
1680
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
200A
200B
202F
205F
3000


来源:https://stackoverflow.com/questions/43322441/is-there-any-reasonable-way-to-access-the-contents-of-a-characterset

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!