问题
The title may be a little confusing. The simplest method must be judging by extension name just like:
// is represents the InputStream
if (filePath.endsWith("doc")) {
WordExtractor ex = new WordExtractor(is);
text = ex.getText();
ex.close();
} else if(filePath.endsWith("docx")) {
XWPFDocument doc = new XWPFDocument(is);
XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
text = extractor.getText();
extractor.close();
}
This works in most cases. But I have found that for certain file whose extension is doc
(a docx
file essentially) if you open using winrar, you will find xml
files. As it is known that a docx
file is a zip
file consists of xml
files.
I believe this problem must not be rare. But I have not found any information about this. Obviously, judging by extension name to read a doc
or docx
is not appropriate.
In my case, I have to read a lot of files. And I will even read the doc
or docx
inside a compressed file, zip
, 7z
or even rar
. Hence, I have to read content by inputStream instead of a File or something else. So how to know whether a file is .docx or .doc format from Apache POI is totally not suitable for my case with ZipInputStream
.
What is the best way to judge a file is a doc
or docx
? I want a solution to read the content from a file which may be doc
or docx
. But not only just simply judge if it is a doc or docx. Apparently, ZipInpuStream
is not a good method for my case. And I believe it is not a appropriate method for others either. Why do I have to judge if the file is doc
or docx
by an exception?
回答1:
Using the current stable apache poi
version 3.17 you may use FileMagic. But internally this will of course also have a look into the files.
Example:
import java.io.InputStream;
import java.io.FileInputStream;
import java.io.BufferedInputStream;
import org.apache.poi.poifs.filesystem.FileMagic;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
public class ReadWord {
static String read(InputStream is) throws Exception {
System.out.println(FileMagic.valueOf(is));
String text = "";
if (FileMagic.valueOf(is) == FileMagic.OLE2) {
WordExtractor ex = new WordExtractor(is);
text = ex.getText();
ex.close();
} else if(FileMagic.valueOf(is) == FileMagic.OOXML) {
XWPFDocument doc = new XWPFDocument(is);
XWPFWordExtractor extractor = new XWPFWordExtractor(doc);
text = extractor.getText();
extractor.close();
}
return text;
}
public static void main(String[] args) throws Exception {
InputStream is = new BufferedInputStream(new FileInputStream("ExampleOLE.doc")); //really a binary OLE2 Word file
System.out.println(read(is));
is.close();
is = new BufferedInputStream(new FileInputStream("ExampleOOXML.doc")); //a OOXML Word file named *.doc
System.out.println(read(is));
is.close();
is = new BufferedInputStream(new FileInputStream("ExampleOOXML.docx")); //really a OOXML Word file
System.out.println(read(is));
is.close();
}
}
回答2:
try {
new ZipFile(new File("/Users/giang/Documents/a.doc"));
System.out.println("this file is .docx");
} catch (ZipException e) {
System.out.println("this file is not .docx");
e.printStackTrace();
}
来源:https://stackoverflow.com/questions/47483011/how-to-judge-if-the-file-is-doc-or-docx-in-poi