Why pytesseract raise an error with Arabic language

半世苍凉 提交于 2021-01-29 07:23:58

问题


I want to use pytesseract Arabic And I have ara.traineddata in my system /usr/share/tesseract/tessdata/ path and i have already installed tesseract package

This is my code:

 import pytesseract
 from PIL import Image
 pytesseract.image_to_string(Image.open('test_arabic.png'), config='', lang="ara")

and i get this error:

TesseractError                            Traceback (most recent call last)

in

----> 1 pytesseract.image_to_string(Image.open('test_persian.png'), config='', lang="ara")

~/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py in image_to_string(image, lang, config, nice, output_type, timeout)
    368     args = [image, 'txt', lang, config, nice, timeout]
    369 
--> 370     return {
    371         Output.BYTES: lambda: run_and_get_output(*(args + [True])),
    372         Output.DICT: lambda: {'text': run_and_get_output(*args)},

~/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py in <lambda>()
    371         Output.BYTES: lambda: run_and_get_output(*(args + [True])),
    372         Output.DICT: lambda: {'text': run_and_get_output(*args)},
--> 373         Output.STRING: lambda: run_and_get_output(*args),
    374     }[output_type]()
    375 

~/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py in run_and_get_output(image, extension, lang, config, nice, timeout, return_bytes)
    280         }
    281 
--> 282         run_tesseract(**kwargs)
    283         filename = kwargs['output_filename_base'] + extsep + extension
    284         with open(filename, 'rb') as output_file:

~/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py in run_tesseract(input_filename, output_filename_base, extension, lang, config, nice, timeout)
    256     with timeout_manager(proc, timeout) as error_string:
    257         if proc.returncode:
--> 258             raise TesseractError(proc.returncode, get_errors(error_string))
    259 
    260 

TesseractError: (1, 'read_params_file: parameter not found:')

Thanks for help.


回答1:


I suggest using the proper language model and the latest version:

For Windows 10:

tesseract-ocr-w64-setup-v5.0.0-alpha.20200328.exe (64 bit) resp.

To validate installation in the power shell or cmd terminal execute:

tesseract -v

It will output something like this: tesseract v5.0.0-alpha.20200328

For Mac OS:

brew install tesseract

To validate installation in the power shell or cmd terminal execute:

tesseract -v

It will output something like this: tesseract 4.1.1 and also the installed image libraries leptonica-1.80.0 libgif 5.2.1 : libjpeg 9d : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.1.0 : libopenjp2 2.3.1 Found AVX2 Found AVX Found FMA Found SSE

If you are not sure about the path, then simply copy paste the ara.traindata file in the same folder as that of your Python .py file

import pytesseract
from PIL import Image
import os
os.environ["TESSDATA_PREFIX"] = "" # Leaving it empty because file is already copy pasted in the current directory
print(os.getenv("TESSDATA_PREFIX"))
# Copy paste the ara.traineddata file in the same directory as this python code
print(pytesseract.image_to_string(Image.open('cropped.png'), lang="ara"))

For Linux/Ubuntu OS:

sudo apt-get install tesseract-ocr

The validation and run code is same as that of Mac Os

Also make sure the path is fine.

This code works fine if the ara.traineddata file is downloaded successfully:

import pytesseract
from PIL import Image
print(pytesseract.image_to_string(Image.open('cropped.png'), lang="ara"))

You can follow this tutorial for details. Here is the demo output of this tutorial which uses Arabic language as well.



来源:https://stackoverflow.com/questions/64244290/why-pytesseract-raise-an-error-with-arabic-language

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!