How to determine character similarity?

ぃ、小莉子 提交于 2019-11-30 23:45:18

If you're looking for a table that will allow you to calculate a 'replacement cost' of sorts based on visual similarity, I've been searching for such a thing for awhile with little success, so I started looking at it as a new problem. I'm not working with OCR, but I am looking for a way to limit the search parameters in a probabilistic search for mis-typed characters. Since they are mis-typed because a human has confused the characters visually, the same principle should apply to you.

My approach was to categorize letters based on their stroke components in an 8-bit field. the bits are, left to right:

7: Left Vertical
6: Center Vertical
5: Right Vertical
4: Top Horizontal
3: Middle Horizontal
2: Bottom Horizontal
1: Top-left to bottom-right stroke
0: Bottom-left to top-right stroke

For lower-case characters, descenders on the left are recorded in bit 1, and descenders on the right in bit 0, as diagonals.

With that scheme, I came up with the following values which attempt to rank the characters according to visual similarity.

m:               11110000: F0
g:               10111101: BD
S,B,G,a,e,s:     10111100: BC
R,p:             10111010: BA
q:               10111001: B9
P:               10111000: B8
Q:               10110110: B6
D,O,o:           10110100: B4
n:               10110000: B0
b,h,d:           10101100: AC
H:               10101000: A8
U,u:             10100100: A4
M,W,w:           10100011: A3
N:               10100010: A2
E:               10011100: 9C
F,f:             10011000: 98
C,c:             10010100: 94
r:               10010000: 90
L:               10000100: 84
K,k:             10000011: 83
T:               01010000: 50
t:               01001000: 48
J,j:             01000100: 44
Y:               01000011: 43
I,l,i:           01000000: 40
Z,z:             00010101: 15
A:               00001011: 0B
y:               00000101: 05
V,v,X,x:         00000011: 03

This, as it stands, is too primitive for my purposes and requires more work. You may be able to use it, however, or perhaps adapt it to suit your purposes. The scheme is fairly simple. This ranking is for a mono-space font. If you are using a sans-serif font, then you likely have to re-work the values.

This table is a hybrid table including all characters, lower- and upper-case, but if you split it into upper-case only and lower-case only it might prove more effective, and that would also allow to apply specific casing penalties.

Keep in mind that this is early experimentation. If you see a way to improve it (for example by changing the bit-sequencing) by all means feel free to do so.

In general I've seen Damerau-Levenshtein used much more often than just Levenshtein , and it basically adds the transposition operation. It is supposed to account for more than 80% of human misspelling, so you should certainly consider that.

As to your specific problem, you could easily modify the algorithm to increase the cost when substituting a capital letter with a non capital letter, and the opposite to obtain something like that:

dist(Co, CY) = 2
dist(Co, CZ) = 2
dist(Co, Ca) = 1
Andrew Tomazos

So in your distance function just have a different cost for replacing different pairs of characters.

That is, rather than a replacement adding a set cost of one or two irrepective of the characters involved - instead have a replace cost function that returns something in between 0.0 and 2.0 for the cost of replacing certain characters in certain contexts.

At each step of the memoization, just call this cost function:

cost[x][y] = min(
    cost[x-1][y] + 1, // insert
    cost[x][y-1] + 1, // delete,
    cost[x-1][y-1] + cost_to_replace(a[x],b[y]) // replace
);

Here is my full Edit Distance implementation, just swap the replace_cost constant for a replace_cost function as shown:

https://codereview.stackexchange.com/questions/10130/edit-distance-between-two-strings

In terms of implementing the cost_to_replace function you need a matrix of characters with costs based on how similiar the characters are. There may be such a table floating around, or you could implement it yourself by writing each pair of characters to a pair of images and then comparing the images for similiarity using standard vision techniques.

Alternatively you could use a supervised method whereby you correct several OCR misreads and note the occurences in a table that will then become the above cost table. (ie If the OCR gets it wrong than the characters must be similiar).

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!