How to search contents of multiple pdf files?

后端未结

关注

 13  1147

How could I search the contents of PDF files in a directory/subdirectory? I am looking for some command line tools. It seems that grep can\'t search PDF files.<

相关标签:

13条回答

爱一瞬间的悲伤

2020-11-30 16:35
My actual version of pdfgrep (1.3.0) allows the following:
```
pdfgrep -HiR 'pattern' /path
```
When doing pdfgrep --help:
- H: Print the file name for each match.
- i: Ignore case distinctions.
- R: Search directories recursively.
It works well on my Ubuntu.
0 讨论(0)
发布评论:

提交评论
- 加载中...
醉酒成梦

2020-11-30 16:37

There is an open source common resource grep tool crgrep which searches within PDF files but also other resources like content nested in archives, database tables, image meta-data, POM file dependencies and web resources - and combinations of these including recursive search.

The full description under the Files tab pretty much covers what the tool supports.

I developed crgrep as an opensource tool.

0 讨论(0)
发布评论:

提交评论
- 加载中...
野的像风

2020-11-30 16:39
There is another utility called ripgrep-all, which is based on ripgrep.

It can handle more than just PDF documents, like Office documents and movies, and the author claims it is faster than pdfgrep.

Command syntax for recursively searching the current directory, and the second one limits to PDF files only:
```
rga 'pattern' .
rga --type pdf 'pattern' .
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
佛祖请我去吃肉

2020-11-30 16:45

Recoll is a fantastic full-text GUI search application for Unix/Linux that supports dozens of different formats, including PDF. It can even pass the exact page number and search term of a query to the document viewer and thus allows you to jump to the result right from its GUI.

Recoll also comes with a viable command-line interface and a web-browser interface.

0 讨论(0)
发布评论:

提交评论
- 加载中...

你的背包

2020-11-30 16:47

I made this destructive small script. Have fun with it.

function pdfsearch()
{
    find . -iname '*.pdf' | while read filename
    do
        #echo -e "\033[34;1m// === PDF Document:\033[33;1m $filename\033[0m"
        pdftotext -q -enc ASCII7 "$filename" "$filename."; grep -s -H --color=always -i $1 "$filename."
        # remove it!  rm -f "$filename."
    done
}

0 讨论(0)

清歌不尽

2020-11-30 16:47
I like @sjr's answer however I prefer xargs vs -exec. I find xargs more versatile. For example with -P we can take advantage of multiple CPUs when it makes sense to do so.
```
find . -name '*.pdf' | xargs -P 5 -I % pdftotext % - | grep --with-filename --label="{}" --color "pattern"
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 3 下一页