flask RESTful service for pdfbox

心已入冬 提交于 2020-07-20 06:54:12

问题


#!/usr/bin/env python3

import jpype
import jpype.imports
jpype.addClassPath(sys.argv[1])
jpype.startJVM(convertStrings=False)
import org.apache.pdfbox.tools as tools
tools.ExtractText.main(['-startPage', '1', sys.argv[2], sys.argv[3]])

I use the following python code to call pdfbox.

$ ./main.py pdfbox-app-2.0.20.jar in.pdf output.txt

But it would be slow to load jar file each time when I want to convert a pdf file. Could anybody providing the flask code to make a RESTful service so that pdfbox can be loaded only once then it will be access to extract text from PDF?

PS. This is tutorial is not good for solving my questions.

https://flask.palletsprojects.com/en/1.1.x/patterns/fileuploads/

For example, it imports send_from_directory which is a little remote from the complete solution. What I need is an example program that can take an input from the REST inteface and save the file somewhere then call the java code then send the file back. Therefore, a single example showing all the three steps is needed.


回答1:


You can create a POST route in Flask which would receive uploaded PDF file, process it with pdfbox and return whatever you need back to user (either text content or text file itself). I didn't test this code, it's just an example to get the idea how to handle it, hope it will be helpful!

"""
Pseudo-code with possible mistakes, not tested, just to get the idea...
"""
import gzip
from io import BytesIO

import jpype
import jpype.imports
import org.apache.pdfbox.tools as tools
from flask import Flask, make_response
from flask import request

UPLOAD_FOLDER = '/path/to/the/uploads'
ALLOWED_EXTENSIONS = {'txt', 'pdf', 'png', 'jpg', 'jpeg', 'gif'}

app = Flask(__name__)
app.config['UPLOAD_FOLDER'] = UPLOAD_FOLDER

jpype.addClassPath('pdfbox-app-2.0.20.jar')
jpype.startJVM(convertStrings=False)


def allowed_file(filename):
    """ Helper function to figure out if file is a-ok"""
    return '.' in filename and \
        filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS


@app.route('/', methods=['POST'])
def index():
    """ Route to upload and process PDF files """
    uploaded_file_name = 'upload.pdf'
    converted_file = 'output.txt'

    # Get the file from the form upload (or any other desired way)
    # If you use curl with -F flag, request will contain 'form' data
    file_data = request.form['file']

    # Save content of file to local disk (e.g. save as .pdf file)
    with open(uploaded_file_name, 'w') as f:
        f.write(file_data)

    # Call your Java thingie
    try:
        tools.ExtractText.main(['-startPage', '1', uploaded_file_name, converted_file])
    except:
        print('Error processing file.')
        # Do extra post-processing if needed

    # Serve back whatever you need here...
    response = make_response(converted_file)
    gzip_buffer = BytesIO()
    gzip_file = gzip.GzipFile(mode='wb', fileobj=gzip_buffer)

    gzip_file.write(response.get_data())
    gzip_file.close()
    response.set_data(gzip_buffer.getvalue())
    response.headers.set('Content-Encoding', 'gzip')
    response.headers.set('Content-Length', len(response.get_data()))
    response.headers.set('Content-Disposition', 'attachment', filename=converted_file)
    return response


来源:https://stackoverflow.com/questions/62629131/flask-restful-service-for-pdfbox

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!