I want to write JavaScript code to extract all image files from a PDF file, perhaps getting them as JPG or some other image format. There is already some JavaScript code for
If you open a page with pdf.js
, for example
PDFJS.getDocument({url: <pdf file>}).then(function (doc) {
doc.getPage(1).then(function (page) {
window.page = page;
})
})
you can then use getOperatorList
to search for paintJpegXObject
objects and grab the resources.
window.objs = []
page.getOperatorList().then(function (ops) {
for (var i=0; i < ops.fnArray.length; i++) {
if (ops.fnArray[i] == PDFJS.OPS.paintJpegXObject) {
window.objs.push(ops.argsArray[i][0])
}
}
})
Now args
will have a list of the resources from that page that you need to fetch.
console.log(window.args.map(function (a) { page.objs.get(a) }))
should print to the console a bunch of <img />
objects with data-uri src=
attributes. These can be directly inserted into the page, or you can do more scripting to get at the raw data.
It only works for embedded JPEG objects, but it's a start!