tesseract

Get font of recognized character with Tesseract-OCR

时光怂恿深爱的人放手 提交于 2020-01-23 08:26:07
问题 Is it possible to get the font of the recognized characters with Tesseract-OCR, i.e. are they Arial or Times New Roman, either from the command-line or using the API. I'm scanning documents that might have different parts with different fonts, and it would be useful to have this information. 回答1: Tesseract has an API WordFontAttributes function defined in ResultIterator class that you can use. 回答2: Based on nguyenq's answer i wrote a simple python script that prints the font name for each

Tesseract-ocr gem issue on mac os x

淺唱寂寞╮ 提交于 2020-01-23 07:32:25
问题 I've installed a tesseract-ocr (0.1.5) gem. Dependencies are also installed (tesseract/3.04.00 and leptonica/1.72) Mac OS X Yosemity. When I do rake db:migrate an error appears: rake aborted! CompilationError: compile error: see logs at /var/folders/xg/g9n7qvns5z1gsr_yjh09n1nm0000gn/T/.ffi-inline-501/d2f8bb8a1867b800ff8ad69a3b850c91521b3760.log /Users/user/.rvm/gems/ruby-2.2.2@project/gems/ffi-inline-0.0.4.3/lib/ffi/inline/compilers/gcc.rb:35:in `compile' /Users/user/.rvm/gems/ruby-2.2.2

Text binarization

北慕城南 提交于 2020-01-23 01:59:49
问题 I'd like to binarize this image: to use it with tesseract-ocr. Currently, I managed to get this: But I need clear image with only text, without black background parts, like this one: My current code: img = cv2.imread(path, 0) blur = cv2.GaussianBlur(img, (3, 3), 0) filtered = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 405, 1) bitnot = cv2.bitwise_not(filtered) cv2.imshow('image', bitnot) cv2.imwrite("h2kcw2/out1.png", bitnot) cv2.waitKey(0) cv2

tesseract

∥☆過路亽.° 提交于 2020-01-23 01:06:15
文章目录 图形验证码识别技术 tesseract 安装 windows 系统 linux 系统 在 win 命令行中使用 tesseract 识别图像 python 中使用 tesseract 图形验证码识别技术 有时候在登录或者请求一些数据时侯会出现图形验证码,因此需要学会将图片翻译成文字的技术 将图片翻译成文字一般被称为光学文字识别(optical character recognition),简称 OCR,实现 OCR 的库不是很多,特别是开源的,因为这块存在一定的技术壁垒(需要大量的数据,算法,机器学习,深度学习知识等),并且如果做好了具有很高的商业价值,因此开源的比较少,我们这里用 tesseract tesseract tesseract 是一个 OCR 库,目前由谷歌赞助,tesseract 是目前公认的最优秀,最准确的开源 OCR 库,tesseract 具有很高的识别度,也具有很高的灵活性,可以通过训练识别任何字体 安装 windows 系统 在以下链接下载可执行文件,然后一直点击下一步安装即可(放在不需要权限的纯英文路径下) https://github.com/UB-Mannheim/tesseract/wiki 有识别语言的选项,根据自己需求勾选 安装完要添加环境 linux 系统 可以在以下链接下载源码自行编译, https://github.com

How to reduce the size of the PDF generated by tesseract?

北战南征 提交于 2020-01-22 20:48:06
问题 The setup of my (web) app is the following: I get user uploaded PDF files, I run OCR on them and show them the OCRed PDF. Since everything is online, the minimizing the size of the resulting PDF file is key to reduce loading and wait time for the user. The file I receive from the user is sample.pdf (I've created an archive with the original files as well as those that I generate here: https://dl.dropboxusercontent.com/u/1390155/tess-files/sample.zip). I use tesseract 3.04 and do the following

Use Tesseract 4 - Docker Container from uwsgi-nginx-flask-docker

感情迁移 提交于 2020-01-22 16:03:25
问题 I had my python project running local, and it works. I use tesseract from python with the subprocess package. Then I deployed my project and since I use Flask, I installed tiangolo-uwsgi-flask-nginx-docker but, Tesseract isn't installed there. That's why my project doesn't work anymore because it cannot find tesseract. And it doesn't recognize the tesseract that is installed on my AWS instance because tesseract isn't installed in the docker container. That's why I would like to use also

C# 使用Tesseract-OCR-v5.0,实现验证码,中文,身份证识别

人走茶凉 提交于 2020-01-22 14:48:28
OCR(Optical Character Recognition):光学字符识别,是指对图片文件中的文字进行分析识别,获取的过程。 Tesseract:开源的OCR识别引擎,初期Tesseract引擎由HP实验室研发,后来贡献给了开源软件业,后经由Google进行改进,消除bug,优化,重新发布。项目地址: https://github.com/tesseract-ocr 本文使用最新版本Tesseract-OCR-v5.0,实现验证码,中文,身份证识别,效果如下图 ​ 演示程序结构 在vs2019创建WinForm窗体程序,添加相应的控件 ​ 程序执行 使用Process类,调用tesseract.exe执行图片识别。注意这个要执行成功,要先安装tesseract-ocr。具体安装,设置环境变量请看上一文章< Tesseract-OCR-v5.0中文识别,训练自定义字库,提高图片的识别效果 >的第1到3步骤。 ​ 总结 本文演示了,C# 使用Tesseract-OCR-v5.0,实现验证码,中文,身份证识别。如果想提高图像的识别率,参考上一文章。使用Tesseract-OCR基本上可以实现简单识别了。 ​ 来源: https://www.cnblogs.com/channel9/p/12228457.html

Tesseract: How to run tesseract with multiple languages one time

只愿长相守 提交于 2020-01-22 13:48:25
问题 I have to analyzed a image which containing both English and Japanese texts. When I run tesseract by default (eng), some Japanese characters lost. Otherwise, if I run tesseract with japanese (-l jpn) some English characters lost (e.p. Email). How can I run one process which recognize both English and Japanese characters. Thanks. 回答1: Since tesseract 3.02 it is possible to specify multiple languages for the -l parameter. -l lang The language to use. If none is specified, English is assumed.

Preserving indentation with Tesseract OCR 4.x

本秂侑毒 提交于 2020-01-22 13:16:04
问题 I'm struggling with Tesseract OCR. I have a blood examination image, it has a table with indentation. Although tesseract recognizes the characters very well, its structure isn't preserved in the final output. For example, look the lines below "Emocromo con formula" (Eng. Translation: blood count with formula) that are indented. I want to preserve that indentation. I read the other related discussions and I found the option preserve_interword_spaces=1 . The result became slightly better but as

How to extract text or numbers from images using python

♀尐吖头ヾ 提交于 2020-01-20 08:34:20
问题 I want to extract text (mainly numbers) from images like this I tried this code import pytesseract from PIL import Image pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe' img = Image.open('1.jpg') text = pytesseract.image_to_string(img, lang='eng') print(text) but all i get is this (hE PPAR) 回答1: When performing OCR, it is important to preprocess the image so the desired text to detect is in black with the background in white . To do this, here's a simple