How to remove OCR artifacts from text?

ⅰ亾dé卋堺 提交于 2020-01-13 11:29:10

问题


OCR generated texts sometimes come with artifacts, such as this one:

Diese grundsätzliche V e r b o r g e n h e i t Gottes, die sich n u r dem N a c h f o l g e r ö f f n e t , ist m i t d e m Messiasgeheimnis gemeint

While it is not unusual, that the spacing between letters is used as emphasis (probably due to early printing press limitations), it is unfavorable for retrieval tasks.

How can one turn the above text into a more, say, canonical form, like:

Diese grundsätzliche Verborgenheit Gottes, die sich nur dem Nachfolger öffnet, ist mit dem Messiasgeheimnis gemeint

Can this be done efficiently for large amounts of text?


One idea would be to concatenate the whole string (to skip the guessing, where word boundaries are) and then run a text segmentation algorithm on it, maybe something similar to this: http://norvig.com/ngrams/


回答1:


If you have a dictionary for the target language, and all spaced-out words consist of just a single word, then it's easy: Just scan through the text, looking for maximal-length runs of spaced-out single letters, and replace them with the single corresponding dictionary word if it exists (and otherwise leave them unchanged).

The only real difficulty is with strings like m i t d e m that correspond to two or more separate words. A simple way would be to greedily "nibble off" prefixes that appear in the dictionary, but this might lead to suboptimal results, and in particular to a suffix that doesn't correspond to any dictionary string even though a different choice of breakpoints would have worked (e.g. b e i m A r z t won't work if you greedily grab bei instead of beim from the front). Fortunately there's a simple linear-time DP approach that will do a better job -- and can even incorporate weights on words, which can help to get the most likely decomposition in the event that there is more than one. Given a string S[1 .. n] (with spaces removed), we will compute f(i), the score of the best decomposition of the length-i prefix of S, for all 1 <= i <= n:

f(0) = 0
f(i) = max over all 0 <= j < i of f(j) + dictScore(S[j+1 .. i])

f(n) will then be the score of the best possible decomposition of the entire string. If you set dictScore(T) to 1 for words that exist in the dictionary and 0 for words that don't, you will get a decomposition into as many words as possible; if you set dictScore(T) to, e.g., -1 for words that exist in the dictionary and -2 for words that don't, you'll get a decomposition into as few words as possible. You can also choose to award higher scores for more "likely" words.

After computing these scores, you can walk back through the DP matrix to reconstruct a decomposition that corresponds to the maximal score.



来源:https://stackoverflow.com/questions/31136692/how-to-remove-ocr-artifacts-from-text

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!