领导让做一个识别发票的服务,之前都是写增删改查,完全没接触过图像识别这种高大上的东西,记录一下吧
新建一个项目,导入tess4j
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>4.4.0</version>
</dependency>
写一个测试类
package com.example.cor1.test;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
import java.io.File;
public class Test1 {
public static void main(String[] args) throws TesseractException {
File test1 = new File("C:\\Users\\xxx\\Desktop\\tesseract\\test1.png");
Tesseract tesseract = new Tesseract();
tesseract.setLanguage("chi_sim");
String s = tesseract.doOCR(test1);
System.out.println(s);
}
}
启动就报错了
Exception in thread "main" java.lang.NoSuchMethodError: com.sun.jna.Native.load(Ljava/lang/String;Ljava/lang/Class;)Lcom/sun/jna/Library;
at net.sourceforge.tess4j.util.LoadLibs.getTessAPIInstance(LoadLibs.java:85)
at net.sourceforge.tess4j.TessAPI.<clinit>(TessAPI.java:42)
at net.sourceforge.tess4j.Tesseract.init(Tesseract.java:427)
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:223)
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:195)
at com.example.cor1.test.Test1.main(Test1.java:15)
提示下面这里没有Native.load方法
public static TessAPI getTessAPIInstance() {
return (TessAPI)Native.load(getTesseractLibName(), TessAPI.class);
}
用IDEA点Native,是4.几的jna包,但是4.4.0版本的tess4j的pom.xml中依赖的jna是5.3.1,用IDEA的查看maven依赖关系的那个功能看,也没看到有冲突,不知道为啥,所以把pom.xml改了下
<dependency>
<groupId>net.java.dev.jna</groupId>
<artifactId>jna</artifactId>
<version>5.3.1</version>
</dependency>
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>4.4.0</version>
<exclusions>
<exclusion>
<artifactId>commons-io</artifactId>
<groupId>commons-io</groupId>
</exclusion>
<exclusion>
<artifactId>commons-logging</artifactId>
<groupId>commons-logging</groupId>
</exclusion>
<exclusion>
<artifactId>jna</artifactId>
<groupId>net.java.dev.jna</groupId>
</exclusion>
</exclusions>
</dependency>
commons那两个是看到有版本冲突,顺便去掉了
之后启动就不报上面那个错误了,报了新的错
Error opening data file ./chi_sim.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'chi_sim'
Tesseract couldn't load any languages!
Warning: Invalid resolution 0 dpi. Using 70 instead.
Exception in thread "main" java.lang.Error: Invalid memory access
at com.sun.jna.Native.invokePointer(Native Method)
at com.sun.jna.Function.invokePointer(Function.java:497)
at com.sun.jna.Function.invoke(Function.java:441)
at com.sun.jna.Function.invoke(Function.java:361)
at com.sun.jna.Library$Handler.invoke(Library.java:265)
at com.sun.proxy.$Proxy0.TessBaseAPIGetUTF8Text(Unknown Source)
at net.sourceforge.tess4j.Tesseract.getOCRText(Tesseract.java:517)
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:359)
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:228)
at net.sourceforge.tess4j.Tesseract.doOCR(Tesseract.java:195)
at com.example.cor1.test.Test1.main(Test1.java:15)
提示找不到中文的训练库,看了下jar包里的tessdata里是没中文的,其实这个错是因为没指定训练库的路径
从https://github.com/tesseract-ocr/tessdata下载训练库。放到根目录下
代码里也要指定路径
public static void main(String[] args) throws TesseractException {
File test1 = new File("C:\\Users\\xxx\\Desktop\\tesseract\\test1.png");
Tesseract tesseract = new Tesseract();
tesseract.setDatapath("tessdata");
tesseract.setLanguage("chi_sim");
String s = tesseract.doOCR(test1);
System.out.println(s);
}
然后再运行就能识别出来啦
原图:
识别出来的:
显而易见,识别的不太对,还需要努力!
来源:https://blog.csdn.net/qq_36956002/article/details/100083564