Pytesseract random bug when reading text

女生的网名这么多〃 提交于 2019-12-06 13:17:06

As the comment said, it's about your text and background color. Tesseract is basically useless with light text on dark background, here is the few lines i apply to any text image before giving it to tesseract :

# convert color image to grayscale
grayscale_image = cv2.cvtColor(your_image, cv2.COLOR_BGR2GRAY)

# Otsu Tresholding method find perfect treshold, return an image with only black and white pixels
_, binary_image = cv2.threshold(gray, 0, 255, cv2.THRESH_OTSU)

# we just don't know if the text is in black and background in white or vice-versa
# so we count how many black pixels and white pixels there are
count_white = numpy.sum(binary > 0)
count_black = numpy.sum(binary == 0)

# if there are more black pixels than whites, then it's the background that is black so we invert the image's color
if count_black > count_white:
    binary_image = 255 - binary_image

black_text_white_background_image = binary_image

Now you're sure to have black text on white background no matter wich colors was the original image, also Tesseract is (weirdly) the most efficient if the characters have an height of 35pixels, larger characters doesn't significantly reduce the accuracy, but just a few pixels shorter can make tesseract useless!

Preprocessing is an important step before throwing the image into Pytesseract. Generally, you want to have the desired text in black with the background in white. Currently, your foreground text is in green instead of white. Here's a simple process to fix the format

  • Convert image to grayscale
  • Otsu's threshold to obtain a binary image
  • Invert image

Original image

Otsu's threshold

Invert image

Output from Pytesseract

122 Vitalité

Other image

200 Vitalité

Before inverting the image, it may be a good idea to perform morphological operations to smooth/filter the text. But for your images, the text does not necessary require additional smoothing

import cv2
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

image = cv2.imread('3.png',0)
thresh = cv2.threshold(image, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
result = 255 - thresh

data = pytesseract.image_to_string(result, lang='eng',config='--psm 6')
print(data)

cv2.imshow('thresh', thresh)
cv2.imshow('result', result)
cv2.waitKey()
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!