Improve horizontal line detection in .pdf image with OpenCV

问题

I have .pdf files that have been converted to .jpg images for this project. My goal is to identify the blanks (e.g ____________) that you would generally find in a .pdf form that indicate a space for the user to sign of fill out some kind of information. I have been using edge detection with the cv2.Canny() and cv2.HoughlinesP() functions.

This works fairly well, but there are quite a few false positives that come about from seemingly nowhere. When I look at the 'edges' file it shows a bunch of noise around the other words. I'm uncertain where this noise comes from.

Should I continue to tweak the parameters, or is there a better method to find the location of these blanks?

回答1:

Assuming that you're trying to find horizontal lines on a .pdf form, here's a simple approach:

Convert image to grayscale and adaptive threshold image
Construct special kernel to detect only horizontal lines
Perform morphological transformations
Find contours and draw onto image

Using this example image

Convert to grayscale and adaptive threshold to obtain a binary image

gray = cv2.cvtColor(image,cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

Then we create a kernel with cv2.getStructuringElement() and perform morphological transformations to isolate horizontal lines

horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (15,1))
detected_lines = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, horizontal_kernel, iterations=2)

From here we can use cv2.HoughLinesP() to detect lines but since we have already preprocessed the image and isolated the horizontal lines, we can just find contours and draw the result

cnts = cv2.findContours(detected_lines, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]

for c in cnts:
    cv2.drawContours(image, [c], -1, (36,255,12), 3)

Full code

import cv2

image = cv2.imread('2.png')
gray = cv2.cvtColor(image,cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (15,1))
detected_lines = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, horizontal_kernel, iterations=2)

cnts = cv2.findContours(detected_lines, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]

for c in cnts:
    cv2.drawContours(image, [c], -1, (36,255,12), 3)

cv2.imshow('thresh', thresh)
cv2.imshow('detected_lines', detected_lines)
cv2.imshow('image', image)
cv2.waitKey()

来源：https://stackoverflow.com/questions/57260893/improve-horizontal-line-detection-in-pdf-image-with-opencv

标签

python

OpenCV

image-processing

computer-vision

tesseract