Extract text with strikethrough from image

问题

Here's an example image ->

I would like to extract text that has text-decoration/styling of strikethrough. So for the above image I would like to extract - de location

How would I do this ?

Here's what I have so far using OpenCV and python :

import cv2
import numpy as np
import matplotlib.pyplot as plt
im = cv2.imread(<image>)
kernel = np.ones((1,44), np.uint8)
morphed = cv2.morphologyEx(im, cv2.MORPH_CLOSE, kernel)
plt.imshow(morphed)

This gives me the horizontal lines ->

I am new to image processing and hence having a difficult time isolating only the text that has strikethroughs.

Bonus -> Along with the strikethrough text, I would like to also extract neighboring text so that I can correctly style/mark the strikethrough text information back along with other text.

UPDATE 1 : Based on the first answer I did the following : -

import cv2
# Load image, convert to grayscale, Otsu's threshold
image = cv2.imread('image.png')
result = image.copy()
gray = cv2.cvtColor(image,cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + 
cv2.THRESH_OTSU)[1]
# Detect horizontal lines
horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT,(40,1))
detect_horizontal = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, 
horizontal_kernel, iterations=10)
cnts = cv2.findContours(detect_horizontal, cv2.RETR_EXTERNAL, 
cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    cv2.drawContours(result, [c], -1, (36,255,12), 2)
plt.imshow(result)

I was able to get this image -

I tried playing with the values for the horizontal kernel but no luck.

UPDATE 2: I modified the above snippet further and got this -

import cv2
import numpy as np
import matplotlib.pyplot as plt
# Load image, convert to grayscale, Otsu's threshold
result = image.copy()
gray = cv2.cvtColor(image,cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

kernel = np.ones((4,2),np.uint8)
erosion = cv2.erode(thresh,kernel,iterations = 1)
dilation = cv2.dilate(thresh,kernel,iterations = 1)

trans = dilation
# plt.imshow(erosion)

# Detect horizontal lines
horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (8,1))
detect_horizontal = cv2.morphologyEx(trans, cv2.MORPH_OPEN, horizontal_kernel, iterations=10)
cnts = cv2.findContours(detect_horizontal, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
for c in cnts:
    cv2.drawContours(result, [c], -1, (36,255,12), 2)
plt.imshow(result)

I was able to get this image - And this solution applies to my other image types as well -

This is not a 100% accuracy solution (failed to get the de strikethrough text) but I like the performance so far.

Now, I am struggling with how to check if the neighboring pixels are black or white to isolate the strikethrough.

回答1:

one way you can achieve this is:

Binarise the image (https://docs.opencv.org/master/d7/d4d/tutorial_py_thresholding.html)
Find horizontal lines (Horizontal Line detection with OpenCV)
For each line, check if the top and bottom pixels are white or not
If there are non white top and bottom pixels, that region corresponds to strikethrough
Do a connected component of the image (connected component labeling in python)
Check the label corresponding to the lines detected previously and mask that label to get the strike-through texts.

回答2:

You can use a strikethrough property such as thickness. The thickness of the strikethrough line is less than the underline. It can be select by morphology and restore the connected components by morphological reconstruction.

import cv2
img = cv2.imread('juFpe.png', cv2.IMREAD_GRAYSCALE)
thresh = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY_INV )[1]
kernel = cv2.getStructuringElement(cv2.MORPH_RECT,(1,5))
kernel2=cv2.getStructuringElement(cv2.MORPH_RECT,(8,8))
detect_thin = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel)
detect_thin = cv2.morphologyEx(detect_thin, cv2.MORPH_DILATE, kernel2)
marker=cv2.compare(detect_thin, thresh,cv2.CMP_LT) # thin lines
while True: #morphological reconstruction
    tmp=marker.copy()
    marker=cv2.dilate(marker, kernel2)
    marker=cv2.min(thresh, marker)
    difference = cv2.subtract(marker, tmp)
    if cv2.countNonZero(difference) == 0:
        break

cv2.imwrite('lines.png', marker)

Result:

来源：https://stackoverflow.com/questions/62669589/extract-text-with-strikethrough-from-image

标签

python

OpenCV

image-processing

underline

strikethrough