tesseract

宜信OCR技术探索与实践|直播速记

随声附和 提交于 2020-08-06 21:10:42
宜信OCR技术探索与实践​|直播速记 ​ 宜信OCR技术探索与实践|完整视频回放 ​ ​ ​分享实录 一、OCR概述 1.1 OCR技术演进 传统图像,冈萨雷斯的图像处理。 信号处理、频域分析以及各类算法:SIFT、HOG、HOUGH、Harris、Canny…都很赞。 从2016年以后业界基本上都已经转向深度了,因为效果真的特别好。 1.2 OCR技术商业服务 身份证卡证类相对容易些,但是要做到复杂场景的,也不是那么容易。 发票、业务单据相对复杂,除了识别,更重要的是版面分析。 最近表格识别比较火,各家都在努力实现,微软的开放tablebank数据集 移动端backboneMobileNet,或者是tesseract+opencv 二、我们的业务场景 2.1 业务需求 满足业务是第一需要,不同于大厂,对外服务API,要求大并发那么强,多样性品类完备,我们更强调单品要做到尽量达到业务要求,更强调定制化,可以分布走,业务上可以给反馈不断改进。 2.2 识别过程中需要解决的问题 三、OCR算法详解 3.1 算法概述——分享原则 大家一定要自己弄细节,读代码、甚至自己动手撸,自己训练,调参,排错,才能有真正的体会和理解,只讲我认为每个算法里面不太好理解,重点,以及容易忽略的点,跟同行一起交流,沟通。 一个模型,要全面深入了解,需要: 目标、目的、意义是啥? 网络结构啥样? loss是啥?

OCR 从图片表格中提取数据

自作多情 提交于 2020-07-29 08:50:37
需求分析 一些图片格式的表格数据,需要从中提取完整数据。 解题思路 图片中数据位置规则,应该是 Excel 等软件直接导出的。 第一步想到的是互联网上是否直接有该文件提供?因为是中文数据,用 Baidu,Sogou 等几个搜索引擎尝试找了几个关键词,都没有找到。 然后试了一下互联网上几个可以试用的表格 OCR 工具,对于这种有一些些独特格式的表格,没有很好的识别能力,识别结果感人。 最终方案是自己处理图片,切割出独立的图片块数据,做 OCR 识别出内容,这样可以有很好的识别率。 OCR 工具调研 tesseract (github #OCR top ) easyocr (开源 OCR 新秀) baidu-aip tesseract 和 easyocr 都是开源的 OCR 项目,安装完成之后还需要下载模型文件,体积都很大,考验网络稳定性。 baidu-aip 是百度提供的 AI 相关的 api SDK,在百度ai平台上申请账号后提供一些免费额度使用,因为是在线识别,速度会比较快。腾讯也有相关的文字识别 api,免费额度小很多。 识别流程 去除水印 图像的水印深度比字体浅,通过将图像转换成灰度后可以过滤掉浅色的像素即可完全去除水印 img = cv2.imread('1.png') gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) gray =

How do I check for output of tesseract in bash script?

蹲街弑〆低调 提交于 2020-07-23 06:55:06
问题 I am running a loop in bash script and passing png files to tesseract to read the text of image files. If output of the tesseract ocr shows Empty page!! or nothing then I want the loop to proceed to next image. If it does include text then I want to store the output in a text file. This is what my basic script looks like, for i in {1..100} do tesseract file-${i}.png stdout >> result.txt done 回答1: This is roughly what you need. I took the liberty to do an "ls" to list png files in a directory,

How do I check for output of tesseract in bash script?

泪湿孤枕 提交于 2020-07-23 06:54:25
问题 I am running a loop in bash script and passing png files to tesseract to read the text of image files. If output of the tesseract ocr shows Empty page!! or nothing then I want the loop to proceed to next image. If it does include text then I want to store the output in a text file. This is what my basic script looks like, for i in {1..100} do tesseract file-${i}.png stdout >> result.txt done 回答1: This is roughly what you need. I took the liberty to do an "ls" to list png files in a directory,

How do I check for output of tesseract in bash script?

梦想与她 提交于 2020-07-23 06:53:17
问题 I am running a loop in bash script and passing png files to tesseract to read the text of image files. If output of the tesseract ocr shows Empty page!! or nothing then I want the loop to proceed to next image. If it does include text then I want to store the output in a text file. This is what my basic script looks like, for i in {1..100} do tesseract file-${i}.png stdout >> result.txt done 回答1: This is roughly what you need. I took the liberty to do an "ls" to list png files in a directory,

Tesseract ocr act weird while scalling up image size. How to know which scale factor is best for some particular types of image?

混江龙づ霸主 提交于 2020-06-27 18:35:11
问题 I have this 006.jpg image and i tried following python code I downloaded "eng" from tessdata_best and renamed it to "eng_best" img = cv2.imread(file_path) lang = "eng_best" for img_scale_factor in range (1,8): print(file_path, img_scale_factor) img = cv2.resize(img,None,fx=img_scale_factor,fy=img_scale_factor) hocr_data = pytesseract.image_to_pdf_or_hocr(img, extension="hocr", lang=lang, config="--dpi 1") file_name = '{0:03d}_jpg_{1}_x{3}.{2}'.format(6, lang, "hocr", img_scale_factor) with

UnicodeDecodeError 'charmap' codec with Tesseract OCR in Python

喜你入骨 提交于 2020-06-27 18:18:12
问题 I am trying to do OCR on an image file in python using teseract-OCR. My environment is- Python 3.5 Anaconda on Windows Machine. Here is the code: from PIL import Image from pytesseract import image_to_string out = image_to_string(Image.open('sample.png')) The error I am getting is : File "Anaconda3\lib\sitepackages\pytesseract\pytesseract.py", line 167, in image_to_string return f.read().strip() File "Anaconda3\lib\encodings\cp1252.py", line 23 in decode return codecs.charmap_decode(input,

UnicodeDecodeError 'charmap' codec with Tesseract OCR in Python

▼魔方 西西 提交于 2020-06-27 18:14:39
问题 I am trying to do OCR on an image file in python using teseract-OCR. My environment is- Python 3.5 Anaconda on Windows Machine. Here is the code: from PIL import Image from pytesseract import image_to_string out = image_to_string(Image.open('sample.png')) The error I am getting is : File "Anaconda3\lib\sitepackages\pytesseract\pytesseract.py", line 167, in image_to_string return f.read().strip() File "Anaconda3\lib\encodings\cp1252.py", line 23 in decode return codecs.charmap_decode(input,

Captcha preprocessing and solving with Opencv and pytesseract

时间秒杀一切 提交于 2020-06-24 14:17:45
问题 Problem I am trying to write code in Python for the Image preprocessing and recognition using Tesseract-OCR. My goal is to solve this form of captcha reliably. Original captcha and result of each preprocessing step Steps as of Now Greyscale and thresholding of image Image enhancing with PIL Convert to TIF and scale to >300px Feed it to Tesseract-OCR (whitelisting all uppercase alphabets) However, I still get an rather incorrect reading (EPQ M Q). What other preprocessing steps can I take to

Captcha preprocessing and solving with Opencv and pytesseract

坚强是说给别人听的谎言 提交于 2020-06-24 14:12:59
问题 Problem I am trying to write code in Python for the Image preprocessing and recognition using Tesseract-OCR. My goal is to solve this form of captcha reliably. Original captcha and result of each preprocessing step Steps as of Now Greyscale and thresholding of image Image enhancing with PIL Convert to TIF and scale to >300px Feed it to Tesseract-OCR (whitelisting all uppercase alphabets) However, I still get an rather incorrect reading (EPQ M Q). What other preprocessing steps can I take to