I need to compare the contents of two almost similar files and highlight the dissimilar portions in the corresponding pdf file. Am using pdfbox. Please help me atleast with
You can do the same thing with a shell script on Linux. The script wraps 3 components:
compare commandpdftk utilityIt's rather easy to translate this into a .bat Batch file for DOS/Windows...
Here are the building blocks:
Use this command to split multipage PDF files into multiple singlepage PDFs:
pdftk first.pdf burst output somewhere/firstpdf_page_%03d.pdf
pdftk 2nd.pdf burst output somewhere/2ndpdf_page_%03d.pdf
Use this command to create a "diff" PDF page for each of the pages:
compare \
-verbose \
-debug coder -log "%u %m:%l %e" \
somewhere/firstpdf_page_001.pdf \
somewhere/2ndpdf_page_001.pdf \
-compose src \
somewhereelse/diff_page_001.pdf
Note, that compare is part of ImageMagick. But for PDF processing it needs Ghostscript as a 'delegate', because it cannot do so natively itself.
Now you can again concatenate your "diff" PDF pages with pdftk:
pdftk \
somewhereelse/diff_page_*.pdf \
cat \
output somewhereelse/diff_allpages.pdf
Ghostscript automatically inserts meta data (such as the current date+time) into its PDF output. Therefore this is not working well for MD5hash-based file comparisons.
If you want to automatically discover all cases which consist of purely white pages (that means: there are no visible differences in your input pages), you could also convert to a meta-data free bitmap format using the bmp256 output device. You can do that for the original PDFs (first.pdf and 2nd.pdf), or for the diff-PDF pages:
gs \
-o diff_page_001.bmp \
-r72 \
-g595x842 \
-sDEVICE=bmp256 \
diff_page_001.pdf
md5sum diff_page_001.bmp
Just create an all-white BMP page with its MD5sum (for reference) like this:
gs \
-o reference-white-page.bmp \
-r72 \
-g595x842 \
-sDEVICE=bmp256 \
-c "showpage quit"
md5sum reference-white-page.bmp