How to search contents of multiple pdf files?

后端未结

关注

 13  1182

How could I search the contents of PDF files in a directory/subdirectory? I am looking for some command line tools. It seems that grep can\'t search PDF files.<

相关标签:

13条回答

情话喂你

2020-11-30 16:48
First convert all your pdf files to text files:
```
for file in *.pdf;do pdftotext "$file"; done
```
Then use grep as normal. This is especially good as it is fast when you have multiple queries and a lot of PDF files.
0 讨论(0)
发布评论:

提交评论
- 加载中...
無奈伤痛

2020-11-30 16:51

You need some tools like pdf2text to first convert your pdf to a text file and then search inside the text. (You will probably miss some information or symbols).

If you are using a programming language there are probably pdf libraries written for this purpose. e.g. http://search.cpan.org/dist/CAM-PDF/ for Perl

0 讨论(0)
发布评论:

提交评论
- 加载中...
[愿得一人]

2020-11-30 16:52
If You want to see file names with pdftotext use following command:
```
find . -name '*.pdf' -exec echo {} \; -exec pdftotext {} - \; | grep "pattern\|pdf" 
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
走了就别回头了

2020-11-30 16:53
There is pdfgrep, which does exactly what its name suggests.
```
pdfgrep -R 'a pattern to search recursively from path' /some/path
```
I've used it for simple searches and it worked fine.

(There are packages in Debian, Ubuntu and Fedora.)

Since version 1.3.0 pdfgrep supports recursive search. This version is available in Ubuntu since Ubuntu 12.10 (Quantal).
0 讨论(0)
发布评论:

提交评论
- 加载中...
你的背包

2020-11-30 16:53

try using 'acroread' in a simple script like the one above

0 讨论(0)
发布评论:

提交评论
- 加载中...
忘了有多久

2020-11-30 16:56
Your distribution should provide a utility called pdftotext:
```
find /path -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "your pattern"' \;
```
The "-" is necessary to have pdftotext output to stdout, not to files. The --with-filename and --label= options will put the file name in the output of grep. The optional --color flag is nice and tells grep to output using colors on the terminal.

(In Ubuntu, pdftotext is provided by the package xpdf-utils or poppler-utils.)

This method, using pdftotext and grep, has an advantage over pdfgrep if you want to use features of GNU grep that pdfgrep doesn't support. Note: pdfgrep-1.3.x supports -C option for printing line of context.
0 讨论(0)
发布评论:

提交评论
- 加载中...