How to search contents of multiple pdf files?

后端 未结 13 1113
误落风尘
误落风尘 2020-11-30 15:58

How could I search the contents of PDF files in a directory/subdirectory? I am looking for some command line tools. It seems that grep can\'t search PDF files.<

相关标签:
13条回答
  • 2020-11-30 16:35

    My actual version of pdfgrep (1.3.0) allows the following:

    pdfgrep -HiR 'pattern' /path
    

    When doing pdfgrep --help:

    • H: Print the file name for each match.
    • i: Ignore case distinctions.
    • R: Search directories recursively.

    It works well on my Ubuntu.

    0 讨论(0)
  • 2020-11-30 16:37

    There is an open source common resource grep tool crgrep which searches within PDF files but also other resources like content nested in archives, database tables, image meta-data, POM file dependencies and web resources - and combinations of these including recursive search.

    The full description under the Files tab pretty much covers what the tool supports.

    I developed crgrep as an opensource tool.

    0 讨论(0)
  • 2020-11-30 16:39

    There is another utility called ripgrep-all, which is based on ripgrep.

    It can handle more than just PDF documents, like Office documents and movies, and the author claims it is faster than pdfgrep.

    Command syntax for recursively searching the current directory, and the second one limits to PDF files only:

    rga 'pattern' .
    rga --type pdf 'pattern' .
    
    0 讨论(0)
  • 2020-11-30 16:45

    Recoll is a fantastic full-text GUI search application for Unix/Linux that supports dozens of different formats, including PDF. It can even pass the exact page number and search term of a query to the document viewer and thus allows you to jump to the result right from its GUI.

    Recoll also comes with a viable command-line interface and a web-browser interface.

    0 讨论(0)
  • 2020-11-30 16:47

    I made this destructive small script. Have fun with it.

    function pdfsearch()
    {
        find . -iname '*.pdf' | while read filename
        do
            #echo -e "\033[34;1m// === PDF Document:\033[33;1m $filename\033[0m"
            pdftotext -q -enc ASCII7 "$filename" "$filename."; grep -s -H --color=always -i $1 "$filename."
            # remove it!  rm -f "$filename."
        done
    }
    
    0 讨论(0)
  • 2020-11-30 16:47

    I like @sjr's answer however I prefer xargs vs -exec. I find xargs more versatile. For example with -P we can take advantage of multiple CPUs when it makes sense to do so.

    find . -name '*.pdf' | xargs -P 5 -I % pdftotext % - | grep --with-filename --label="{}" --color "pattern"
    
    0 讨论(0)
提交回复
热议问题