问题
#!/usr/bin/env python3
import jpype
import jpype.imports
jpype.addClassPath(sys.argv[1])
jpype.startJVM(convertStrings=False)
import org.apache.pdfbox.tools as tools
tools.ExtractText.main(['-startPage', '1', sys.argv[2], sys.argv[3]])
I use the following python code to call pdfbox.
$ ./main.py pdfbox-app-2.0.20.jar in.pdf output.txt
But it would be slow to load jar file each time when I want to convert a pdf file. Could anybody providing the flask code to make a RESTful service so that pdfbox can be loaded only once then it will be access to extract text from PDF?
PS. This is tutorial is not good for solving my questions.
https://flask.palletsprojects.com/en/1.1.x/patterns/fileuploads/
For example, it imports send_from_directory
which is a little remote from the complete solution. What I need is an example program that can take an input from the REST inteface and save the file somewhere then call the java code then send the file back. Therefore, a single example showing all the three steps is needed.
回答1:
You can create a POST
route in Flask which would receive uploaded PDF file, process it with pdfbox
and return whatever you need back to user (either text content or text file itself). I didn't test this code, it's just an example to get the idea how to handle it, hope it will be helpful!
"""
Pseudo-code with possible mistakes, not tested, just to get the idea...
"""
import gzip
from io import BytesIO
import jpype
import jpype.imports
import org.apache.pdfbox.tools as tools
from flask import Flask, make_response
from flask import request
UPLOAD_FOLDER = '/path/to/the/uploads'
ALLOWED_EXTENSIONS = {'txt', 'pdf', 'png', 'jpg', 'jpeg', 'gif'}
app = Flask(__name__)
app.config['UPLOAD_FOLDER'] = UPLOAD_FOLDER
jpype.addClassPath('pdfbox-app-2.0.20.jar')
jpype.startJVM(convertStrings=False)
def allowed_file(filename):
""" Helper function to figure out if file is a-ok"""
return '.' in filename and \
filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS
@app.route('/', methods=['POST'])
def index():
""" Route to upload and process PDF files """
uploaded_file_name = 'upload.pdf'
converted_file = 'output.txt'
# Get the file from the form upload (or any other desired way)
# If you use curl with -F flag, request will contain 'form' data
file_data = request.form['file']
# Save content of file to local disk (e.g. save as .pdf file)
with open(uploaded_file_name, 'w') as f:
f.write(file_data)
# Call your Java thingie
try:
tools.ExtractText.main(['-startPage', '1', uploaded_file_name, converted_file])
except:
print('Error processing file.')
# Do extra post-processing if needed
# Serve back whatever you need here...
response = make_response(converted_file)
gzip_buffer = BytesIO()
gzip_file = gzip.GzipFile(mode='wb', fileobj=gzip_buffer)
gzip_file.write(response.get_data())
gzip_file.close()
response.set_data(gzip_buffer.getvalue())
response.headers.set('Content-Encoding', 'gzip')
response.headers.set('Content-Length', len(response.get_data()))
response.headers.set('Content-Disposition', 'attachment', filename=converted_file)
return response
来源:https://stackoverflow.com/questions/62629131/flask-restful-service-for-pdfbox