ocr

Unable to extract scanned pdf using TesseractOCRConfig Apache Tika

倾然丶 夕夏残阳落幕 提交于 2020-01-29 17:58:17
问题 My pdf contains scanned images and I want to extract text from it. What I tried : I tried with AutoDetectParsers but no output. I followed the solution provided in Apache Tika extract scanned PDF files and also Apache Tika Jira at https://issues.apache.org/jira/browse/TIKA-1729 but getting empty string without any error. My configuration : Win 7 64-bit OS, JDK 1.8.0_45. Any kind of help is welcome. 回答1: Steps to follow to solve this : Install Tesseract in your system using 'tesseract-ocr

基于Tesseract—OCR技术的文字识别优化

假如想象 提交于 2020-01-26 23:48:05
一、需求分析 对天猫平台的企业信息采集下来进行结构化处理,提取出文字信息后汇总进Excel作为交付文件。 主要的功能设计如下: 1、程序能够自动读取企业工商信息图片所在的文件夹路径,并从文件夹路径中顺序取出图片进行识别,最终的识别结果以一份汇总的Excel交付。 2、因为天猫平台公示的图片内容没有固定格式,所以需要程序能匹配不同格式的图片内容提取信息。 3、能够提取出图片中的企业注册号、企业名称数据项,企业注册号、企业名称数据项要进行分析处理。 4、识别准确率需要保证在95%以上。 5、识别速度保持在60秒识别50张图片。 二、本程序处理图片方面的关键模块 1、对图片进行切割: 要求识别的文字信息“企业名称”“企业注册号”位于整个图片的其中一部分,把剩余部分切除,只留下关键信息部分,不但可以提高识别速度,还可提升识别率。 2、在进行图片的二值化时,有两种方式: (1)图片为彩色时,宜找到每个像素点合适的灰色度,因为每个像素点的灰色度不同程度上受到周边像素加权影响,从而影响整个图片的识别率。比如本像素点加上周围8个灰度值再除以9,算出其相对灰度值。 (2)图片为黑白色时,宜采用max-min方法对图片进行二值化。 针对本程序识别的图片的黑白色对比明显,故采用max-min方法进行二值化。 private static File binaryImage(File orcFile)

After cropping a image, how to find new bounding box coordinates?

血红的双手。 提交于 2020-01-25 06:50:09
问题 Here's a receipt image that I've got and I've plotted it using matplotlib, # x1, y1, x2, y2, x3, y3, x4, y4 bbox_coords = [[650, 850], [1040, 850], [1040, 930], [650, 930]] image = cv2.imread(IMG_FILE) gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) fig, ax = plt.subplots(figsize=(20, 20)) ax.imshow(gray, cmap='Greys_r'); rect = Polygon(bbox_coords, fill=False, linewidth=1, edgecolor='r') ax.add_patch(rect) plt.show() print(gray.shape) (4376, 2885) Then, I've cropped the original gray image

other options for AWS Textract .Net SDK

空扰寡人 提交于 2020-01-25 06:43:13
问题 I am working on a C# MVC solution which needs to support the uploading of 1000s of scanned .PDF survey forms onto a system and then extract the data from each survey; in order to extract hand-written checkboxes I need to use the AWS Textract API. More information on my project can be found here: AWS textract with hand-written checkboxes My problem is when I downloaded the AWS SDK for .NET I noticed that .Textract is not fully available at the minute for .NET My question being, is there any

OCR : Not getting desired result

雨燕双飞 提交于 2020-01-24 01:19:29
问题 I have this image . I am trying to OCR the letters in this image. I am not getting desired result for letters '9' and 'R'. First I cropped these letters, & and executing following command. tesseract 9.png stdout -psm 8 . It is just returning "." OCR for all other letters are working fine but not for these two letters(though, I think their image quality is not that bad). Any suggestion/help is appreciated. 回答1: I've no experience with tesseract myself, but replicating the character and adding

Image recognition with PHP

≯℡__Kan透↙ 提交于 2020-01-23 08:08:40
问题 I was wondering if there's any way of writing a PHP script that can read an image and look for specific elements in it. For example, the image will contain a list of names and for each name there will be a box where a specific character will be present. I want to be able to get all the names and to check for which names that specific character is present. Thank you. 回答1: You should try to use an OCR class already made, like this one: http://www.phpclasses.org/package/2874-PHP-Recognize-text

Image recognition with PHP

 ̄綄美尐妖づ 提交于 2020-01-23 08:07:24
问题 I was wondering if there's any way of writing a PHP script that can read an image and look for specific elements in it. For example, the image will contain a list of names and for each name there will be a box where a specific character will be present. I want to be able to get all the names and to check for which names that specific character is present. Thank you. 回答1: You should try to use an OCR class already made, like this one: http://www.phpclasses.org/package/2874-PHP-Recognize-text

How to reduce the size of the PDF generated by tesseract?

北战南征 提交于 2020-01-22 20:48:06
问题 The setup of my (web) app is the following: I get user uploaded PDF files, I run OCR on them and show them the OCRed PDF. Since everything is online, the minimizing the size of the resulting PDF file is key to reduce loading and wait time for the user. The file I receive from the user is sample.pdf (I've created an archive with the original files as well as those that I generate here: https://dl.dropboxusercontent.com/u/1390155/tess-files/sample.zip). I use tesseract 3.04 and do the following

2019 年终总结

强颜欢笑 提交于 2020-01-22 14:14:41
今天是最后一个工作日,又是一年的结束 上次年终总结的目标: 1。 英语。 2。 Azure Cloud 3。BLE 4。图像识别(护照OCR) 1,2,都没有去完成,基本上完全放弃了。因为中途来了一个项目,实在是时间紧的不得了。 3,4 基本算完成。 陆陆续续花了好多天的时间去了解了BLE的原理,但是今天回忆一下,好像都忘记了 :( opencv 看完了里面的tutorial,感觉对这个有比较多的了解。 对于护照的OCR,找到了一个例子,基本上完完全全满足需求,所以就不做了 :D 中途做了一个小例子: 答题卡上有很多的选项,要查看考生选择(涂黑)了哪个,首先需要知道每个选项的区域,然后再来判断区域是否被涂黑了。 通过opencv的一些处理,现在可以基本上准确的把所有的选项都标出来,如下图,所有的ABCD项都用红框标出来了 今年最大的改变还是换了一份工作,另外完成了一个项目 自己不太能预计到还会换工作,以前感觉会在公司待到退休(劝退)。不过,当机会来的时候,还是兴奋的。也可能是对当时公司里的工作,已经满满的无奈! 另外一个是完成了一个项目,意义重大! 明年目标: 英语 Flutter ----END---- 来源: CSDN 作者: Ani 链接: https://blog.csdn.net/Ani/article/details/104068863

Tesseract: How to run tesseract with multiple languages one time

只愿长相守 提交于 2020-01-22 13:48:25
问题 I have to analyzed a image which containing both English and Japanese texts. When I run tesseract by default (eng), some Japanese characters lost. Otherwise, if I run tesseract with japanese (-l jpn) some English characters lost (e.p. Email). How can I run one process which recognize both English and Japanese characters. Thanks. 回答1: Since tesseract 3.02 it is possible to specify multiple languages for the -l parameter. -l lang The language to use. If none is specified, English is assumed.