问题
I need to automatically remove the mildly colored background of a scanned document image for OCR.
ScanTailor is an open source C++ GUI-based app that does background separation among other things, but I cannot figure out how to run only the last step which actually removes the background.
Ideally, I could find the code that does this and either:
- Port that part to C#
- Modify the C++ to respond to command line execution, only performing that step on a given image
Can you help me understand how I can do either?
or do you know other libraries that can do this? (any language/platform acceptable)
回答1:
You are referring to Thresholding, Despeckling and Noise Removal techniques which are necessary in OCR applications.
The quality of the results depends very much an many different factors -
Print quality of the original Scan quality Image resolution Background colours and patterns used. Noise and other marks.
You may find the IEvolution.NET library at http://www.hi-components.com/nievolution.asp useful. It has many image processing functions to play with.
There are many commercial engines available. There is no one perfect function to solve image processing problems. You must adapt the functions and parameter to match your images. http://www.recogniform.com/thresholding.htm
- Best threshold for converting grayscale to black and white
- Adaptive threshold binarization: post-processing for removing ghost objects
- Adaptive threshold Binarization's bad effects
- fast threshold and bit packing algorithm ( possible improvements ? )
A Google search will show up lots of results.
回答2:
Maybe the algorithm is, approximately:
- Decide what the background color is
- Scan the bitmap for pixels whose color is (and/or is sufficiently similar to) the background color
- Convert these pixels to white or transparent
- Possibly (especially if the page contains images and not just text) ignore isolated pixels, which are the background color but are not next to other also-background pixels
If it's a high-resolution low-color-depth (e.g. black-and-white) image, then you need to apply this algorithm to groups of pixels.
来源:https://stackoverflow.com/questions/4327172/separation-of-background-foreground-layers-in-a-scanned-document