发表新帖

发表新帖

Extracting text from PDFs in C# [closed]

后端未结

关注

 6  1125

误落风尘 2020-11-29 05:40

6条回答

情深已故 (楼主)

2020-11-29 06:35
Take a look at Tika on DotNet, available through Nuget: https://www.nuget.org/packages/TikaOnDotnet.TextExtractor/

This is a wrapper around the extremely good Tika java library, using IKVM. Very easy to use and handles a wide variety of file types other than PDF, including old and new office formats. It will auto-select the parser based on the file extension, so it's as easy as:
```
var text = new TextExtractor().Extract(file.FullName).Text;
```
Update: One caution with this solution is that development on IKVM has ended. I'm not sure what this will mean in the long run. http://weblog.ikvm.net/2017/04/21/TheEndOfIKVMNET.aspx
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...

热议问题