Extract images from PDF file with JavaScript

后端 未结 1 1304
感情败类
感情败类 2020-12-16 03:30

I want to write JavaScript code to extract all image files from a PDF file, perhaps getting them as JPG or some other image format. There is already some JavaScript code for

相关标签:
1条回答
  • 2020-12-16 04:12

    If you open a page with pdf.js, for example

    PDFJS.getDocument({url: <pdf file>}).then(function (doc) {
        doc.getPage(1).then(function (page) {
            window.page = page;
        })
    })
    

    you can then use getOperatorList to search for paintJpegXObject objects and grab the resources.

    window.objs = []
    page.getOperatorList().then(function (ops) {
        for (var i=0; i < ops.fnArray.length; i++) {
            if (ops.fnArray[i] == PDFJS.OPS.paintJpegXObject) {
                window.objs.push(ops.argsArray[i][0])
            }
        }
    })
    

    Now args will have a list of the resources from that page that you need to fetch.

    console.log(window.args.map(function (a) { page.objs.get(a) }))
    

    should print to the console a bunch of <img /> objects with data-uri src= attributes. These can be directly inserted into the page, or you can do more scripting to get at the raw data.

    It only works for embedded JPEG objects, but it's a start!

    0 讨论(0)
提交回复
热议问题