Tesseract OCR on AWS Lambda via virtualenv

前端 未结 4 1359
终归单人心
终归单人心 2020-11-30 22:36

I have spent all week attempting this, so this is a bit of a hail mary.

I am attempting to package up Tesseract OCR into AWS Lambda running on Python (I am also usin

4条回答
  •  一向
    一向 (楼主)
    2020-11-30 23:09

    Adapatations for tesseract 4:

    Tesseract offers much improvements in version 4, thanks to a neural network. I've tried it with some scans and the improvements are quite substantial. Plus the whole package was 25% smaller in my case. Planned release date of version 4 is first half of 2018.

    The build steps are similar to tesseract 3 with some tweaks, that's why I wanted to share them in full. I also made a github repo with ready made binary files (most of it is based on Jose's post above, which was very helpful), plus a blog post how to use it as a processing step after a raspberrypi3 powered scanner step.

    To compile the tesseract4 binaries, do these steps on a fresh 64bit AWS AIM instance:

    Compile leptonica

    cd ~
    sudo yum install clang -y
    sudo yum install libpng-devel libtiff-devel zlib-devel libwebp-devel libjpeg-turbo-devel -y
    wget https://github.com/DanBloomberg/leptonica/releases/download/1.75.1/leptonica-1.75.1.tar.gz
    tar -xzvf leptonica-1.75.1.tar.gz
    cd leptonica-1.75.1
    ./configure && make && sudo make install
    

    Compile autoconf-archive

    Unfortunately, since some weeks tesseract needs autoconf-archive, which is not available for amazon AIMs, so you'd need to compile it on your own:

    cd ~
    wget http://mirror.switch.ch/ftp/mirror/gnu/autoconf-archive/autoconf-archive-2017.09.28.tar.xz
    tar -xvf autoconf-archive-2017.09.28.tar.xz
    cd autoconf-archive-2017.09.28
    ./configure && make && sudo make install
    sudo cp m4/* /usr/share/aclocal/
    

    Compile tesseract

    cd ~
    sudo yum install git-core libtool pkgconfig -y
    git clone --depth 1  https://github.com/tesseract-ocr/tesseract.git tesseract-ocr
    cd tesseract-ocr
    export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
    ./autogen.sh
    ./configure
    make
    sudo make install
    

    Get all needed files and zip

    cd ~
    mkdir tesseract-standalone
    cd tesseract-standalone
    cp /usr/local/bin/tesseract .
    mkdir lib
    cp /usr/local/lib/libtesseract.so.4 lib/
    cp /usr/local/lib/liblept.so.5 lib/
    cp /usr/lib64/libjpeg.so.62 lib/
    cp /usr/lib64/libwebp.so.4 lib/
    cp /usr/lib64/libstdc++.so.6 lib/
    mkdir tessdata
    cd tessdata
    wget https://github.com/tesseract-ocr/tessdata_fast/raw/master/osd.traineddata
    wget https://github.com/tesseract-ocr/tessdata_fast/raw/master/eng.traineddata
    # additionally any other language you want to use, e.g. `deu` for Deutsch
    mkdir configs
    cp /usr/local/share/tessdata/configs/pdf configs/
    cp /usr/local/share/tessdata/pdf.ttf .
    cd ..
    zip -r ~/tesseract-standalone.zip *
    

提交回复
热议问题