how to access and manipulate pdf file's datas in Hadoop?

后端未结

关注

 2  1538

耶瑟儿～ 2021-01-07 13:28

I want to read the PDF file using hadoop, how it is possible? I only know that hadoop can process only txt files, so is there anyway to parse the PDF files to txt.

2条回答

耶瑟儿～ (楼主)

2021-01-07 13:50

An easy way would be to create a SequenceFile to contain the PDF files. SequenceFile is a binary file format. You could make each record in the SequenceFile a PDF. To do this you would create a class derived from Writable which would contain the PDF and any metadata that you needed. Then you could use any java PDF library such as PDFBox to manipulate the PDFs.

0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...