get all unicode variations of a latin character

问题

E.g., for the character "a", I want to get a string (list of chars) like "aàáâãäåāăą" (not sure if that example list is complete...) (basically all unicode chars with names "Latin Small Letter A with *").

Is there a generic way to get this?

I'm asking for Python, but if the answer is more generic, this is also fine, although I would appreciate a Python code snippet in any case. Python >=3.5 is fine. But I guess you need to have access to the Unicode database, e.g. the Python module unicodedata, which I would prefer over other external data sources.

I could imagine some solution like this:

def get_variations(char):
   import unicodedata
   name = unicodedata.name(char)
   chars = char
   for variation in ["WITH CEDILLA", "WITH MACRON", ...]:
      try: 
          chars += unicodedata.lookup("%s %s" % (name, variation))
      except KeyError:
          pass
   return chars

回答1:

To start, get a collection of the Unicode combining diacritical characters; they're contiguous, so this is pretty easy, e.g.:

# Unicode combining diacritical marks run from 768 to 879, inclusive
combining_chars = ''.join(map(chr, range(768, 880)))

Now define a function that attempts to compose each one with a base ASCII character; when the composed normal form is length 1 (meaning the ASCII + combining became a single Unicode ordinal), save it:

import unicodedata

def get_unicode_variations(letter):
    if len(letter) != 1:
        raise ValueError("letter must be a single character to check for variations")
    variations = []
    # We could just loop over map(chr, range(768, 880)) without caching
    # in combining_chars, but that increases runtime ~20%
    for combiner in combining_chars:
        normalized = unicodedata.normalize('NFKC', letter + combiner)
        if len(normalized) == 1:
            variations.append(normalized)
    return ''.join(variations)

This has the advantage of not trying to manually perform string lookups in the unicodedata DB, and not needing to hardcode all possible descriptions of the combining characters. Anything that composes to a single character gets included; runtime for the check on my machine comes in under 50 µs, so if you're not doing this too often, the cost is reasonable (you could decorate with functools.lru_cache if you intend to call it repeatedly with the same arguments and want to avoid recomputing it every time).

If you want to get everything built out of one of these characters, a more exhaustive search can find it, but it'll take longer (functools.lru_cache would be nigh mandatory unless it's only ever called once per argument):

import functools
import sys
import unicodedata

@functools.lru_cache(maxsize=None)
def get_unicode_variations_exhaustive(letter): 
    if len(letter) != 1:
        raise ValueError("letter must be a single character to check for variations")
    variations = [] 
    for testlet in map(chr, range(sys.maxunicode)): 
        if letter in unicodedata.normalize('NFKD', testlet) and testlet != letter: 
            variations.append(testlet) 
    return ''.join(variations)

This looks for any character that decomposes into a form that includes the target letter; it does mean that searching the first time takes roughly a third of a second, and the result includes stuff that isn't really just a modified version of the character (e.g. 'L''s result will include ℡, which isn't really a "modified 'L'), but it's as exhaustive as you can get.

回答2:

There is none that I know of, however you could build one yourself. Just look up the start and end numbers of your special characters. You can do so using unicode character table. And then for each character create a list using these numbers:

ranges = {
  'A': (192, 199),
  'B': (0, 0),
  'E': (200, 204),
  ...
}

map = {}
for char, rng in ranges.items():
  start, end = rng 
  map[char] = char + ''.join([chr(i) for i in range(start, end)])

This would generate a map such that:

{
  'A': 'AÀÁÂÃÄÅÆ'
  'B': 'B',
  'E': 'EÈÉÊË',
  ...
}

回答3:

You can use the decomposition mappings of the Unicode database directly. The following code checks all mappings for characters with a decomposition starting with a certain letter:

def get_unicode_variations(letter):
    letter_code = ord(letter)
    # For some characters, you might want to check all
    # code points up to 0x10FFFF
    for i in range(65536):
        decomp = unicodedata.decomposition(chr(i))
        # Mappings starting with '<...>' indicate a
        # compatibility mapping (NFKD, NFKC) which we ignore.
        while decomp != '' and not decomp.startswith('<'):
            first_code = int(decomp.split()[0], 16)
            if first_code == letter_code:
                print(chr(i), unicodedata.name(chr(i)))
                break
            # Try to decompose further
            decomp = unicodedata.decomposition(chr(first_code))

This is rather inefficient if you want to process multiple characters, though. For the letter a, the code above prints:

à LATIN SMALL LETTER A WITH GRAVE
á LATIN SMALL LETTER A WITH ACUTE
â LATIN SMALL LETTER A WITH CIRCUMFLEX
ã LATIN SMALL LETTER A WITH TILDE
ä LATIN SMALL LETTER A WITH DIAERESIS
å LATIN SMALL LETTER A WITH RING ABOVE
ā LATIN SMALL LETTER A WITH MACRON
ă LATIN SMALL LETTER A WITH BREVE
ą LATIN SMALL LETTER A WITH OGONEK
ǎ LATIN SMALL LETTER A WITH CARON
ǟ LATIN SMALL LETTER A WITH DIAERESIS AND MACRON
ǡ LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON
ǻ LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE
ȁ LATIN SMALL LETTER A WITH DOUBLE GRAVE
ȃ LATIN SMALL LETTER A WITH INVERTED BREVE
ȧ LATIN SMALL LETTER A WITH DOT ABOVE
ḁ LATIN SMALL LETTER A WITH RING BELOW
ạ LATIN SMALL LETTER A WITH DOT BELOW
ả LATIN SMALL LETTER A WITH HOOK ABOVE
ấ LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE
ầ LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE
ẩ LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE
ẫ LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE
ậ LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW
ắ LATIN SMALL LETTER A WITH BREVE AND ACUTE
ằ LATIN SMALL LETTER A WITH BREVE AND GRAVE
ẳ LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE
ẵ LATIN SMALL LETTER A WITH BREVE AND TILDE
ặ LATIN SMALL LETTER A WITH BREVE AND DOT BELOW

回答4:

With unichars:

› unichars -a | grep -i 'Latin Small Letter A with'
 à  U+000E0 LATIN SMALL LETTER A WITH GRAVE
 á  U+000E1 LATIN SMALL LETTER A WITH ACUTE
 â  U+000E2 LATIN SMALL LETTER A WITH CIRCUMFLEX
 ã  U+000E3 LATIN SMALL LETTER A WITH TILDE
 ä  U+000E4 LATIN SMALL LETTER A WITH DIAERESIS
 å  U+000E5 LATIN SMALL LETTER A WITH RING ABOVE
 ā  U+00101 LATIN SMALL LETTER A WITH MACRON
 ă  U+00103 LATIN SMALL LETTER A WITH BREVE
 ą  U+00105 LATIN SMALL LETTER A WITH OGONEK
 ǎ  U+001CE LATIN SMALL LETTER A WITH CARON
 ǟ  U+001DF LATIN SMALL LETTER A WITH DIAERESIS AND MACRON
 ǡ  U+001E1 LATIN SMALL LETTER A WITH DOT ABOVE AND MACRON
 ǻ  U+001FB LATIN SMALL LETTER A WITH RING ABOVE AND ACUTE
 ȁ  U+00201 LATIN SMALL LETTER A WITH DOUBLE GRAVE
 ȃ  U+00203 LATIN SMALL LETTER A WITH INVERTED BREVE
 ȧ  U+00227 LATIN SMALL LETTER A WITH DOT ABOVE
 ᶏ  U+01D8F LATIN SMALL LETTER A WITH RETROFLEX HOOK
 ◌ᷲ  U+01DF2 COMBINING LATIN SMALL LETTER A WITH DIAERESIS
 ḁ  U+01E01 LATIN SMALL LETTER A WITH RING BELOW
 ẚ  U+01E9A LATIN SMALL LETTER A WITH RIGHT HALF RING
 ạ  U+01EA1 LATIN SMALL LETTER A WITH DOT BELOW
 ả  U+01EA3 LATIN SMALL LETTER A WITH HOOK ABOVE
 ấ  U+01EA5 LATIN SMALL LETTER A WITH CIRCUMFLEX AND ACUTE
 ầ  U+01EA7 LATIN SMALL LETTER A WITH CIRCUMFLEX AND GRAVE
 ẩ  U+01EA9 LATIN SMALL LETTER A WITH CIRCUMFLEX AND HOOK ABOVE
 ẫ  U+01EAB LATIN SMALL LETTER A WITH CIRCUMFLEX AND TILDE
 ậ  U+01EAD LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW
 ắ  U+01EAF LATIN SMALL LETTER A WITH BREVE AND ACUTE
 ằ  U+01EB1 LATIN SMALL LETTER A WITH BREVE AND GRAVE
 ẳ  U+01EB3 LATIN SMALL LETTER A WITH BREVE AND HOOK ABOVE
 ẵ  U+01EB5 LATIN SMALL LETTER A WITH BREVE AND TILDE
 ặ  U+01EB7 LATIN SMALL LETTER A WITH BREVE AND DOT BELOW
 ⱥ  U+02C65 LATIN SMALL LETTER A WITH STROKE

来源：https://stackoverflow.com/questions/57169516/get-all-unicode-variations-of-a-latin-character

标签

python

python-3.x

unicode

unicode-normalization