I\'m trying to read a .doc file into a database so that I can index it\'s contents. Is there an easy way for PHP on Linux to read .doc files? Failing that is it possible to
There seems to be a library for accessing Word documents but not sure how to access it from PHP. I think the best solution would be to call their wv command from PHP.
After days of searching, here is my best solution : http://wvware.sourceforge.net/
Install package
sudo apt-get install wv
Use it in PHP :
$output = str_replace('.doc', '.txt', $filename);
shell_exec('/usr/bin/wvText ' . $filename . ' ' . $output);
$text = file_get_contents($output);
# Convert to UTF-8 if needed
if(!mb_detect_encoding($text, 'UTF-8', true))
{
$text = utf8_encode($text);
}
unlink($output);
I found a unoconv package in Ubuntu. It does conversion between all formats supported by OpenOffice. You should be able to use exec in php to run this utility.
phpLiveDocx is a Zend Framework component and can read and write DOC and RTF files in PHP on Linux, Windows and Mac. Furthermore, you can use it to generate PDF files and even merge data from PHP into template files created with MS Word or Open Office!
See the project web site at:
http://www.phplivedocx.org
You can use antiword or AbiWord to pull the text out and feed it to your favorite full-text indexer. AbiWord is probably more effective for your purposes because it can convert into RTF, PDF and other formats (yes, it's a GUI word processor, but it also supports command-line usage).
DOC files are stored in binary format which there hasn't been any purely php written classes in dealing with them.
RTF files are much easier to parse, being mostly text you can just open them up with fopen and read the contents.
I would suggest using RTF if you can, as there really is not a sound solution for DOC files yet.