Java: Apache POI: Can I get clean text from MS Word (.doc) files?

后端 未结 3 1731
野趣味
野趣味 2020-12-31 10:24

The strings I\'m (programmatically) getting from MS Word files when using Apache POI are not the same text I can look at when I open the files with MS Word.

When usi

相关标签:
3条回答
  • 2020-12-31 10:55

    There are two options, one provided directly in Apache POI, the other via Apache Tika (which uses Apache POI internally).

    The first option is to use WordExtractor, but wrap it in a call to stripFields(String) when calling it. That will remove the text based fields included in the text, things like HYPERLINK that you've seen. Your code would become:

    NPOIFSFileSystem fs = new NPOIFSFileSytem(file);
    WordExtractor extractor = new WordExtractor(fs.getRoot());
    
    for(String rawText : extractor.getParagraphText()) {
    String text = extractor.stripFields(rawText);
    System.out.println(text);
    }
    

    The other option is to use Apache Tika. Tika provides text extraction, and metadata, for a wide variety of files, so the same code will work for .doc, .docx, .pdf and many others too. To get clean, plain text of your word document (you can also get XHTML if you'd rather), you'd do something like:

    TikaConfig tika = TikaConfig.getDefaultConfig();
    TikaInputStream stream = TikaInputStream.get(file);
    ContentHandler handler = new BodyContentHandler();
    Metadata metadata = new Metadata();
    tika.getParser().parse(input, handler, metadata, new ParseContext());
    String text = handler.toString();
    
    0 讨论(0)
  • 2020-12-31 11:01

    Try this, works for me and is purely a POI solution. You will have to look for the HWPFDocument counterpart though. Make sure the document you are reading predates Word 97, else use XWPFDocument like I do.

    InputStream inputstream = new FileInputStream(m_filepath); 
    //read the file 
    XWPFDocument adoc= new XWPFDocument(inputstream);
    //and place it in a xwpf format
    
    aString = new XWPFWordExtractor(adoc).getText();           
    //gets the full text
    

    Now if you want certain parts you can use the getparagraphtext but dont use the text extractor, use it directly on the paragraph like this

    for (XWPFParagraph p : adoc.getParagraphs()) 
    { 
        System.out.println(p.getParagraphText());
    }
    
    0 讨论(0)
  • 2020-12-31 11:09

    This class can read both .doc and .docx files in Java. For this I'm using tika-app-1.2.jar:

    /*
     * This class is used to read .doc and .docx files
     * 
     * @author Developer
     *
     */
    
    import java.io.ByteArrayOutputStream;
    import java.io.File;
    import java.io.InputStream;
    import java.io.OutputStream;
    import java.io.OutputStreamWriter;
    import java.net.URL; 
    import org.apache.tika.detect.DefaultDetector;
    import org.apache.tika.detect.Detector;
    import org.apache.tika.io.TikaInputStream;
    import org.apache.tika.metadata.Metadata;
    import org.apache.tika.parser.AutoDetectParser;
    import org.apache.tika.parser.ParseContext;
    import org.apache.tika.parser.Parser;
    import org.apache.tika.sax.BodyContentHandler;
    import org.xml.sax.ContentHandler;
    
    class TextExtractor { 
        private OutputStream outputstream;
        private ParseContext context;
        private Detector detector;
        private Parser parser;
        private Metadata metadata;
        private String extractedText;
    
        public TextExtractor() {
            context = new ParseContext();
            detector = new DefaultDetector();
            parser = new AutoDetectParser(detector);
            context.set(Parser.class, parser);
            outputstream = new ByteArrayOutputStream();
            metadata = new Metadata();
        }
    
        public void process(String filename) throws Exception {
            URL url;
            File file = new File(filename);
            if (file.isFile()) {
                url = file.toURI().toURL();
            } else {
                url = new URL(filename);
            }
            InputStream input = TikaInputStream.get(url, metadata);
            ContentHandler handler = new BodyContentHandler(outputstream);
            parser.parse(input, handler, metadata, context); 
            input.close();
        }
    
        public void getString() {
            //Get the text into a String object
            extractedText = outputstream.toString();
            //Do whatever you want with this String object.
            System.out.println(extractedText);
        }
    
        public static void main(String args[]) throws Exception {
            if (args.length == 1) {
                TextExtractor textExtractor = new TextExtractor();
                textExtractor.process(args[0]);
                textExtractor.getString();
            } else { 
                throw new Exception();
            }
        }
    }
    

    To compile:

    javac -cp ".:tika-app-1.2.jar" TextExtractor.java
    

    To run:

    java -cp ".:tika-app-1.2.jar" TextExtractor SomeWordDocument.doc
    
    0 讨论(0)
提交回复
热议问题