How to use AWS lambda to convert pdf files to .txt with python

只愿长相守 提交于 2021-01-29 09:57:46

问题


I need to automate the conversion of many pdf to text files using AWS lambda in python 3.7

I've successfully converted pdf files using poppler/pdftotext, tika, and PyPDF2 on my own machine. However tika times out or needs to run a java instance on a host machine which I'm not sure how to set up. pdftotext needs poppler and all the solutions for running that on lambda seems to be outdated or I'm just not familiar enough with binarys to make sense of that solution. PyPDF2 seems the most promising but testing throws an error.

The code and error I'm getting for PyPDF2 is as follows:

pdf_file = open(s3.Bucket(my_bucket).download_file('test.pdf','test.pdf'),'rb')

  "errorMessage": "[Errno 30] Read-only file system: 'test.pdf.3F925aC8'",
  "errorType": "OSError",



and if I try to reference it directly,
pdf_file = open('https://s3.amazonaws.com/' + my_bucket + '/test.pdf', 'rb')

  "errorMessage": "[Errno 2] No such file or directory: 'https://s3.amazonaws.com/my_bucket/test.pdf'",
  "errorType": "FileNotFoundError",

回答1:


AWS lambda only allows you to write into the /tmp folder, so you should download the file and put it in there




回答2:


As the error states, you are trying to write to a read-only filesystem. You are using the download_file method which tries to save the file to 'test.pdf' which fails. Try using download_fileobj (link) together with an in-memory buffer (e.g. io.BytesIO) instead. Then, feed that stream to PyPDF2.

Example:

import io
[...]

pdf_stream = io.StringIO()
object.download_fileobj(pdf_stream)
pdf_obj = PdfFileReader(pdf_stream)

[...]


来源:https://stackoverflow.com/questions/56794361/how-to-use-aws-lambda-to-convert-pdf-files-to-txt-with-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!