Why pytesseract raise an error with Arabic language

问题

I want to use pytesseract Arabic And I have ara.traineddata in my system /usr/share/tesseract/tessdata/ path and i have already installed tesseract package

This is my code:

 import pytesseract
 from PIL import Image
 pytesseract.image_to_string(Image.open('test_arabic.png'), config='', lang="ara")

and i get this error:

TesseractError                            Traceback (most recent call last)

----> 1 pytesseract.image_to_string(Image.open('test_persian.png'), config='', lang="ara")

~/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py in image_to_string(image, lang, config, nice, output_type, timeout)
    368     args = [image, 'txt', lang, config, nice, timeout]
    369 
--> 370     return {
    371         Output.BYTES: lambda: run_and_get_output(*(args + [True])),
    372         Output.DICT: lambda: {'text': run_and_get_output(*args)},

~/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py in <lambda>()
    371         Output.BYTES: lambda: run_and_get_output(*(args + [True])),
    372         Output.DICT: lambda: {'text': run_and_get_output(*args)},
--> 373         Output.STRING: lambda: run_and_get_output(*args),
    374     }[output_type]()
    375 

~/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py in run_and_get_output(image, extension, lang, config, nice, timeout, return_bytes)
    280         }
    281 
--> 282         run_tesseract(**kwargs)
    283         filename = kwargs['output_filename_base'] + extsep + extension
    284         with open(filename, 'rb') as output_file:

~/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py in run_tesseract(input_filename, output_filename_base, extension, lang, config, nice, timeout)
    256     with timeout_manager(proc, timeout) as error_string:
    257         if proc.returncode:
--> 258             raise TesseractError(proc.returncode, get_errors(error_string))
    259 
    260 

TesseractError: (1, 'read_params_file: parameter not found:')

Thanks for help.

回答1:

I suggest using the proper language model and the latest version:

For Windows 10:

tesseract-ocr-w64-setup-v5.0.0-alpha.20200328.exe (64 bit) resp.

To validate installation in the power shell or cmd terminal execute:

tesseract -v

It will output something like this: tesseract v5.0.0-alpha.20200328

For Mac OS:

brew install tesseract

To validate installation in the power shell or cmd terminal execute:

tesseract -v

It will output something like this: tesseract 4.1.1 and also the installed image libraries leptonica-1.80.0 libgif 5.2.1 : libjpeg 9d : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.1.0 : libopenjp2 2.3.1 Found AVX2 Found AVX Found FMA Found SSE

If you are not sure about the path, then simply copy paste the ara.traindata file in the same folder as that of your Python .py file

import pytesseract
from PIL import Image
import os
os.environ["TESSDATA_PREFIX"] = "" # Leaving it empty because file is already copy pasted in the current directory
print(os.getenv("TESSDATA_PREFIX"))
# Copy paste the ara.traineddata file in the same directory as this python code
print(pytesseract.image_to_string(Image.open('cropped.png'), lang="ara"))

For Linux/Ubuntu OS:

sudo apt-get install tesseract-ocr

The validation and run code is same as that of Mac Os

Also make sure the path is fine.

This code works fine if the ara.traineddata file is downloaded successfully:

import pytesseract
from PIL import Image
print(pytesseract.image_to_string(Image.open('cropped.png'), lang="ara"))

You can follow this tutorial for details. Here is the demo output of this tutorial which uses Arabic language as well.

来源：https://stackoverflow.com/questions/64244290/why-pytesseract-raise-an-error-with-arabic-language

标签

python

image-processing

ocr

tesseract

python-tesseract