问题
I want to use pytesseract Arabic And I have ara.traineddata in my system /usr/share/tesseract/tessdata/ path and i have already installed tesseract package
This is my code:
import pytesseract
from PIL import Image
pytesseract.image_to_string(Image.open('test_arabic.png'), config='', lang="ara")
and i get this error:
TesseractError Traceback (most recent call last)
in
----> 1 pytesseract.image_to_string(Image.open('test_persian.png'), config='', lang="ara")
~/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py in image_to_string(image, lang, config, nice, output_type, timeout)
368 args = [image, 'txt', lang, config, nice, timeout]
369
--> 370 return {
371 Output.BYTES: lambda: run_and_get_output(*(args + [True])),
372 Output.DICT: lambda: {'text': run_and_get_output(*args)},
~/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py in <lambda>()
371 Output.BYTES: lambda: run_and_get_output(*(args + [True])),
372 Output.DICT: lambda: {'text': run_and_get_output(*args)},
--> 373 Output.STRING: lambda: run_and_get_output(*args),
374 }[output_type]()
375
~/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py in run_and_get_output(image, extension, lang, config, nice, timeout, return_bytes)
280 }
281
--> 282 run_tesseract(**kwargs)
283 filename = kwargs['output_filename_base'] + extsep + extension
284 with open(filename, 'rb') as output_file:
~/.local/lib/python3.8/site-packages/pytesseract/pytesseract.py in run_tesseract(input_filename, output_filename_base, extension, lang, config, nice, timeout)
256 with timeout_manager(proc, timeout) as error_string:
257 if proc.returncode:
--> 258 raise TesseractError(proc.returncode, get_errors(error_string))
259
260
TesseractError: (1, 'read_params_file: parameter not found:')
Thanks for help.
回答1:
I suggest using the proper language model and the latest version:
For Windows 10:
tesseract-ocr-w64-setup-v5.0.0-alpha.20200328.exe (64 bit) resp.
To validate installation in the power shell or cmd terminal execute:
tesseract -v
It will output something like this: tesseract v5.0.0-alpha.20200328
For Mac OS:
brew install tesseract
To validate installation in the power shell or cmd terminal execute:
tesseract -v
It will output something like this: tesseract 4.1.1 and also the installed image libraries leptonica-1.80.0 libgif 5.2.1 : libjpeg 9d : libpng 1.6.37 : libtiff 4.1.0 : zlib 1.2.11 : libwebp 1.1.0 : libopenjp2 2.3.1 Found AVX2 Found AVX Found FMA Found SSE
If you are not sure about the path, then simply copy paste the ara.traindata file in the same folder as that of your Python .py file
import pytesseract
from PIL import Image
import os
os.environ["TESSDATA_PREFIX"] = "" # Leaving it empty because file is already copy pasted in the current directory
print(os.getenv("TESSDATA_PREFIX"))
# Copy paste the ara.traineddata file in the same directory as this python code
print(pytesseract.image_to_string(Image.open('cropped.png'), lang="ara"))
For Linux/Ubuntu OS:
sudo apt-get install tesseract-ocr
The validation and run code is same as that of Mac Os
Also make sure the path is fine.
This code works fine if the ara.traineddata file is downloaded successfully:
import pytesseract
from PIL import Image
print(pytesseract.image_to_string(Image.open('cropped.png'), lang="ara"))
You can follow this tutorial for details. Here is the demo output of this tutorial which uses Arabic language as well.
来源:https://stackoverflow.com/questions/64244290/why-pytesseract-raise-an-error-with-arabic-language