Lets say I have the characters Ú, Ù, Ü. All of them are similar glyphically to the English U.
Is there some list or algorithm to do this:
It is very unclear what you are asking to do here.
There are characters whose canonical decompositions all start with the same base character: e, é, ê, ë, ē, ĕ, ė, ę, ě, ȅ, ȇ, ȩ, ḕ, ḗ, ḙ, ḛ, ḝ, ẹ, ẻ, ẽ, ế, ề, ể, ễ, ệ, e̳, … or s, ś, ŝ, ş, š, ș, ṡ, ṣ, ṥ, ṧ, ṩ, ….
There are characters whose compatibility decompositions all include a particular character: ᵉ, ₑ, ℯ, ⅇ, ⒠, ⓔ, ㋍, ㋎, e, … or s, ſ, ˢ, ẛ, ₨, ℁, ⒮, ⓢ, ㎧, ㎨, ㎮, ㎯, ㎰, ㎱, ㎲, ㎳, ㏛, ſt, st, s, … or R, ᴿ, ₨, ℛ, ℜ, ℝ, Ⓡ, ㏚, R, ….
There are characters that just happen to look alike in some fonts: ß and β and ϐ, or 3 and Ʒ and Ȝ and ȝ and ʒ and ӡ and ᴣ, or ɣ and ɤ and γ, or F and Ϝ and ϝ, or B and Β and В, or ∅ and ○ and 0 and O and ০ and ੦ and ౦ and ૦, or 1 and l and I and Ⅰ and ᛁ and | and ǀ and ∣, ….
Characters that are the same case-insensitively, like s and S and ſ, or ss and Ss and SS and ß and ẞ, ….
Characters that all have the same numeric value, like all these for the value 1: 1¹١۱߁१১੧૧୧௧౧౹౼೧൧๑໑༡၁႑፩១៱᠑᥇᧑᧚᪁᪑᭑᮱᱁᱑₁⅟ ① ⑴ ⒈ ⓵ ❶➀➊꘡꣑꤁꧑꩑꯱
Why not just compare glyphs with something like this?
package similarglyphcharacterdetector;
import java.awt.Color;
import java.awt.Font;
import java.awt.Graphics2D;
import java.awt.Rectangle;
import java.awt.font.FontRenderContext;
import java.awt.image.BufferedImage;
import java.util.HashMap;
import java.util.LinkedHashMap;
import java.util.Map;
public class SimilarGlyphCharacterDetector {
static char[] TEST_CHARS = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890".toCharArray();
static BufferedImage[] SAMPLES = null;
public static BufferedImage drawGlyph(Font font, String string) {
FontRenderContext frc = ((Graphics2D) new BufferedImage(1, 1, BufferedImage.TYPE_BYTE_GRAY).getGraphics()).getFontRenderContext();
Rectangle r= font.getMaxCharBounds(frc).getBounds();
BufferedImage res = new BufferedImage(r.width, r.height, BufferedImage.TYPE_BYTE_GRAY);
Graphics2D g = (Graphics2D) res.getGraphics();
g.setBackground(Color.WHITE);
g.fillRect(0, 0, r.width, r.height);
g.setPaint(Color.BLACK);
g.setFont(font);
g.drawString(string, 0, r.height - font.getLineMetrics(string, g.getFontRenderContext()).getDescent());
return res;
}
private static void drawSamples(Font f) {
SAMPLES = new BufferedImage[TEST_CHARS.length];
for (int i = 0; i < TEST_CHARS.length; i++)
SAMPLES[i] = drawGlyph(f, String.valueOf(TEST_CHARS[i]));
}
private static int compareImages(BufferedImage img1, BufferedImage img2) {
if (img1.getWidth() != img2.getWidth() || img1.getHeight() != img2.getHeight())
throw new IllegalArgumentException();
int d = 0;
for (int y = 0; y < img1.getHeight(); y++) {
for (int x = 0; x < img1.getWidth(); x++) {
if (img1.getRGB(x, y) != img2.getRGB(x, y))
d++;
}
}
return d;
}
private static int nearestSampleIndex(BufferedImage image, int maxDistance) {
int best = Integer.MAX_VALUE;
int bestIdx = -1;
for (int i = 0; i < SAMPLES.length; i++) {
int diff = compareImages(image, SAMPLES[i]);
if (diff < best) {
best = diff;
bestIdx = i;
}
}
if (best > maxDistance)
return -1;
return bestIdx;
}
public static void main(String[] args) throws Exception {
Font f = new Font("FreeMono", Font.PLAIN, 13);
drawSamples(f);
HashMap<Character, StringBuilder> res = new LinkedHashMap<Character, StringBuilder>();
for (char c : TEST_CHARS)
res.put(c, new StringBuilder(String.valueOf(c)));
int maxDistance = 5;
for (int i = 0x80; i <= 0xFFFF; i++) {
char c = (char)i;
if (f.canDisplay(c)) {
int n = nearestSampleIndex(drawGlyph(f, String.valueOf(c)), maxDistance);
if (n != -1) {
char nc = TEST_CHARS[n];
res.get(nc).append(c);
}
}
}
for (Map.Entry<Character, StringBuilder> entry : res.entrySet())
if (entry.getValue().length() > 1)
System.out.println(entry.getValue());
}
}
Output:
AÀÁÂÃÄÅĀĂĄǍǞȀȦΆΑΛАѦӒẠẢἈἉᾸᾹᾺᾼ₳Å
BƁƂΒБВЬḂḄḆ
CĆĈĊČƇΓЄГСὉℂⅭ
...
This won't work for all conditions, but one way to get rid of most accents is to convert the characters to their decomposed form, then throw away the combining accents:
# coding: utf8
import unicodedata as ud
s=u'U, Ù, Ú, Û, Ü, Ũ, Ū, Ŭ, Ů, Ű, Ų, Ư, Ǔ, Ǖ, Ǘ, Ǚ, Ǜ, Ụ, Ủ, Ứ, Ừ, Ử, Ữ, Ự'
print ud.normalize('NFD',s).encode('ascii','ignore')
U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U
To find accent characters, use something like:
import unicodedata as ud
import string
def asc(unichr):
return ud.normalize('NFD',unichr).encode('ascii','ignore')
U = u''.join(unichr(i) for i in xrange(65536))
for c in string.letters:
print u''.join(u for u in U if asc(u) == c)
aàáâãäåāăąǎǟǡǻȁȃȧḁạảấầẩẫậắằẳẵặ
bḃḅḇ
cçćĉċčḉ
dďḋḍḏḑḓ
eèéêëēĕėęěȅȇȩḕḗḙḛḝẹẻẽếềểễệ
fḟ
:
etc.