问题
I am trying to make a program that will scrape the text off of a screenshot using tesseract and python, and am having no issue getting one piece of it, however some text is lighter colored and is not being picked up by tesseract. Below is an example of a picture I am using:
I am am to get the text at the top of the picture, but not the 3 options below.
Here is the code I am using for grabbing the text
result = pytesseract.image_to_string(
screen, config="load_system_dawg=0 load_freq_dawg=0")
print("below is the total value scraped by the tesseract")
print(result)
# Split up newlines until we have our question and answers
parts = result.split("\n\n")
question = parts.pop(0).replace("\n", " ")
q_terms = question.split(" ")
q_terms = list(filter(lambda t: t not in stop, q_terms))
q_terms = set(q_terms)
parts = "\n".join(parts)
parts = parts.split("\n")
answers = list(filter(lambda p: len(p) > 0, parts))
I when I have plain text in black without a colored background I can get the answers
array to be populated by the 3 below options, however not in this case. Is there any way I can go about fixing this?
回答1:
You're missing binarization, or thresholding step.
In your case you can simply apply binary threshold on grayscale image.
Here is result image with threshold = 177
Here1 you can learn more about Thresholding with opencv python library
来源:https://stackoverflow.com/questions/48530331/tesseract-not-picking-up-different-colored-text