How could I search the contents of PDF files in a directory/subdirectory? I am looking for some command line tools. It seems that grep
can\'t search PDF files.<
My actual version of pdfgrep (1.3.0) allows the following:
pdfgrep -HiR 'pattern' /path
When doing pdfgrep --help
:
It works well on my Ubuntu.
There is an open source common resource grep tool crgrep which searches within PDF files but also other resources like content nested in archives, database tables, image meta-data, POM file dependencies and web resources - and combinations of these including recursive search.
The full description under the Files tab pretty much covers what the tool supports.
I developed crgrep as an opensource tool.
There is another utility called ripgrep-all, which is based on ripgrep.
It can handle more than just PDF documents, like Office documents and movies, and the author claims it is faster than pdfgrep
.
Command syntax for recursively searching the current directory, and the second one limits to PDF files only:
rga 'pattern' .
rga --type pdf 'pattern' .
Recoll is a fantastic full-text GUI search application for Unix/Linux that supports dozens of different formats, including PDF. It can even pass the exact page number and search term of a query to the document viewer and thus allows you to jump to the result right from its GUI.
Recoll also comes with a viable command-line interface and a web-browser interface.
I made this destructive small script. Have fun with it.
function pdfsearch()
{
find . -iname '*.pdf' | while read filename
do
#echo -e "\033[34;1m// === PDF Document:\033[33;1m $filename\033[0m"
pdftotext -q -enc ASCII7 "$filename" "$filename."; grep -s -H --color=always -i $1 "$filename."
# remove it! rm -f "$filename."
done
}
I like @sjr's answer however I prefer xargs vs -exec. I find xargs more versatile. For example with -P we can take advantage of multiple CPUs when it makes sense to do so.
find . -name '*.pdf' | xargs -P 5 -I % pdftotext % - | grep --with-filename --label="{}" --color "pattern"