I have spent all week attempting this, so this is a bit of a hail mary.
I am attempting to package up Tesseract OCR into AWS Lambda running on Python (I am also usin
I have been struggling through this issue for a few days trying to get Tesseract 4 to work on a Python 3.7 Lambda function. Finally I found this article and GitHub which describes how to generate zip files for tesseract, pytesseract, opencv, and pillow using shell scripts that generate the necessary .zip files using Docker images on EC2! This process takes less than 20 minutes using these steps and is reliably reproducible.
Summarized Steps:
Start an Amazon Linux EC2 instance (t2 micro will do just fine)
sudo yum update
sudo yum install git-core -y
sudo yum install docker -y
sudo service docker start
sudo usermod -a -G docker ec2-user #allows ec2-user to call docker
After running the 5th command you will need to logout and log back in for the change to take effect.
git clone https://github.com/amtam0/lambda-tesseract-api.git
cd lambda-tesseract-api/
bash build_tesseract4.sh #takes a few minutes
bash build_py37_pkgs.sh
This will generate .zip files for tesseract, pytesseract, pillow, and opencv. In order to use with lambda you need to complete two more steps.
(Note: you will probably need to increase your Memory allocation and Timeout)
At this point you are all set to upload your code and start using Tesseract on AWS Lambda! Refer back to the Medium article for a test script.