Find characters that are similar glyphically in Unicode?

前端 未结 3 1878
感情败类
感情败类 2020-12-14 09:41

Lets say I have the characters Ú, Ù, Ü. All of them are similar glyphically to the English U.

Is there some list or algorithm to do this:

  • Given a Ú or
相关标签:
3条回答
  • 2020-12-14 10:34

    It is very unclear what you are asking to do here.

    • There are characters whose canonical decompositions all start with the same base character: e, é, ê, ë, ē, ĕ, ė, ę, ě, ȅ, ȇ, ȩ, ḕ, ḗ, ḙ, ḛ, ḝ, ẹ, ẻ, ẽ, ế, ề, ể, ễ, ệ, e̳, … or s, ś, ŝ, ş, š, ș, ṡ, ṣ, ṥ, ṧ, ṩ, ….

    • There are characters whose compatibility decompositions all include a particular character: ᵉ, ₑ, ℯ, ⅇ, ⒠, ⓔ, ㋍, ㋎, e, … or s, ſ, ˢ, ẛ, ₨, ℁, ⒮, ⓢ, ㎧, ㎨, ㎮, ㎯, ㎰, ㎱, ㎲, ㎳, ㏛, ſt, st, s, … or R, ᴿ, ₨, ℛ, ℜ, ℝ, Ⓡ, ㏚, R, ….

    • There are characters that just happen to look alike in some fonts: ß and β and ϐ, or 3 and Ʒ and Ȝ and ȝ and ʒ and ӡ and ᴣ, or ɣ and ɤ and γ, or F and Ϝ and ϝ, or B and Β and В, or ∅ and ○ and 0 and O and ০ and ੦ and ౦ and ૦, or 1 and l and I and Ⅰ and ᛁ and | and ǀ and ∣, ….

    • Characters that are the same case-insensitively, like s and S and ſ, or ss and Ss and SS and ß and ẞ, ….

    • Characters that all have the same numeric value, like all these for the value 1: 1¹١۱߁१১੧૧୧௧౧౹౼೧൧๑໑༡၁႑፩១៱᠑᥇᧑᧚᪁᪑᭑᮱᱁᱑₁⅟ ① ⑴ ⒈ ⓵ ❶➀➊꘡꣑꤁꧑꩑꯱

    0 讨论(0)
  • 2020-12-14 10:35

    Why not just compare glyphs with something like this?

    package similarglyphcharacterdetector;
    
    import java.awt.Color;
    import java.awt.Font;
    import java.awt.Graphics2D;
    import java.awt.Rectangle;
    import java.awt.font.FontRenderContext;
    import java.awt.image.BufferedImage;
    import java.util.HashMap;
    import java.util.LinkedHashMap;
    import java.util.Map;
    
    public class SimilarGlyphCharacterDetector {
    
        static char[] TEST_CHARS = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890".toCharArray();
        static BufferedImage[] SAMPLES = null;
    
        public static BufferedImage drawGlyph(Font font, String string) {
            FontRenderContext frc = ((Graphics2D) new BufferedImage(1, 1, BufferedImage.TYPE_BYTE_GRAY).getGraphics()).getFontRenderContext();
    
            Rectangle r= font.getMaxCharBounds(frc).getBounds();
    
            BufferedImage res = new BufferedImage(r.width, r.height, BufferedImage.TYPE_BYTE_GRAY);
            Graphics2D g = (Graphics2D) res.getGraphics();
            g.setBackground(Color.WHITE);
            g.fillRect(0, 0, r.width, r.height);
            g.setPaint(Color.BLACK);
            g.setFont(font);
            g.drawString(string, 0, r.height - font.getLineMetrics(string, g.getFontRenderContext()).getDescent());
            return res;
        }
    
        private static void drawSamples(Font f) {
            SAMPLES = new BufferedImage[TEST_CHARS.length];
            for (int i = 0; i < TEST_CHARS.length; i++)
                SAMPLES[i] = drawGlyph(f, String.valueOf(TEST_CHARS[i]));
        }
    
        private static int compareImages(BufferedImage img1, BufferedImage img2) {
            if (img1.getWidth() != img2.getWidth() || img1.getHeight() != img2.getHeight())
                throw new IllegalArgumentException();
            int d = 0;
            for (int y = 0; y < img1.getHeight(); y++) {
                for (int x = 0; x < img1.getWidth(); x++) {
                    if (img1.getRGB(x, y) != img2.getRGB(x, y))
                        d++;
                }
            }
            return d;
        }
    
        private static int nearestSampleIndex(BufferedImage image, int maxDistance) {
            int best = Integer.MAX_VALUE;
            int bestIdx = -1;
            for (int i = 0; i < SAMPLES.length; i++) {
                int diff = compareImages(image, SAMPLES[i]);
                if (diff < best) {
                    best = diff;
                    bestIdx = i;
                }
            }
            if (best > maxDistance)
                return -1;
            return bestIdx;
        }
    
        public static void main(String[] args) throws Exception {
            Font f = new Font("FreeMono", Font.PLAIN, 13);
            drawSamples(f);
            HashMap<Character, StringBuilder> res = new LinkedHashMap<Character, StringBuilder>();
            for (char c : TEST_CHARS)
                res.put(c, new StringBuilder(String.valueOf(c)));
            int maxDistance = 5;
            for (int i = 0x80; i <= 0xFFFF; i++) {
                char c = (char)i;
                if (f.canDisplay(c)) {
                    int n = nearestSampleIndex(drawGlyph(f, String.valueOf(c)), maxDistance);
                    if (n != -1) {
                        char nc = TEST_CHARS[n];
                        res.get(nc).append(c);
                    }
                }
            }
            for (Map.Entry<Character, StringBuilder> entry : res.entrySet())
                if (entry.getValue().length() > 1)
                    System.out.println(entry.getValue());
        }
    }
    

    Output:

    AÀÁÂÃÄÅĀĂĄǍǞȀȦΆΑΛАѦӒẠẢἈἉᾸᾹᾺᾼ₳Å
    BƁƂΒБВЬḂḄḆ
    CĆĈĊČƇΓЄГСὉℂⅭ
    ...
    
    0 讨论(0)
  • 2020-12-14 10:40

    This won't work for all conditions, but one way to get rid of most accents is to convert the characters to their decomposed form, then throw away the combining accents:

    # coding: utf8
    import unicodedata as ud
    s=u'U, Ù, Ú, Û, Ü, Ũ, Ū, Ŭ, Ů, Ű, Ų, Ư, Ǔ, Ǖ, Ǘ, Ǚ, Ǜ, Ụ, Ủ, Ứ, Ừ, Ử, Ữ, Ự'
    print ud.normalize('NFD',s).encode('ascii','ignore')
    

    Output

    U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U
    

    To find accent characters, use something like:

    import unicodedata as ud
    import string
    
    def asc(unichr):
        return ud.normalize('NFD',unichr).encode('ascii','ignore')
    
    U = u''.join(unichr(i) for i in xrange(65536))
    for c in string.letters:
        print u''.join(u for u in U if asc(u) == c)
    

    Output

    aàáâãäåāăąǎǟǡǻȁȃȧḁạảấầẩẫậắằẳẵặ
    bḃḅḇ
    cçćĉċčḉ
    dďḋḍḏḑḓ
    eèéêëēĕėęěȅȇȩḕḗḙḛḝẹẻẽếềểễệ
    fḟ
     :
    etc.
    
    0 讨论(0)
提交回复
热议问题