how do I create a corpus of *.docx files with tm?

余生颓废 提交于 2019-12-19 04:14:56

问题


I have a mixed filetype collection of MS Word documents. Some files are *.doc and some are *.docx. I'm learning to use tm and I've (more or less*) successfully created a corpus composed of the *.doc files using this:

ex_eng <- Corpus(DirSource('~/R/expertise/corpus/english'), 
                 readerControl=list(reader=readDOC, 
                                    language='en_CA',
                                    load=TRUE));

This command does not handle *.docx files. I assume that I need a different reader. From this article, I understand that I could write my own (given a good understanding of the .docx format which I do not currently have).

The readDOC reader uses antiword to parse *.doc files. Is there a similar application that will parse *.docx files?

Or better still, is there already a standard way of creating a corpus of *.docx files using tm?


* more or less, because although the files go in and are readable, I get this warning for every document: In readLines(y, encoding = x$Encoding) : incomplete final line found on 'path/to/a/file.doc'


回答1:


.docx files are zipped XML files. If you execute this:

> uzfil <- unzip(file.choose())

And then pick a .docx file in your directory, you get:

> str(uzfil)
 chr [1:13] "./[Content_Types].xml" "./_rels/.rels" "./word/_rels/document.xml.rels" ...
> uzfil
 [1] "./[Content_Types].xml"          "./_rels/.rels"                  "./word/_rels/document.xml.rels"
 [4] "./word/document.xml"            "./word/theme/theme1.xml"        "./docProps/thumbnail.jpeg"     
 [7] "./word/settings.xml"            "./word/webSettings.xml"         "./word/styles.xml"             
[10] "./docProps/core.xml"            "./word/numbering.xml"           "./word/fontTable.xml"          
[13] "./docProps/app.xml"       

This will also silently unpack all of those files to your working directory. The "./word/document.xml" file has the words you are looking for, so you can probably read them with one of the XML tools in package XML. I'm guessing you would do something along the lines of :

 library(XML)
 xtext <-  xmlTreeParse(unz(uzfil[4]), useInternalNodes = TRUE) )

Actually you will probably need to save this to a temp-directory and add that path to the file name, "./word/document.xml".

You may want to use the further steps provided by @GaborGrothendieck in this answer: How to extract xml data from a CrossRef using R?




回答2:


I ended up using docx2txt to convert the .docx files to text. Then I created a corpus from them like this:

ex_eng <- Corpus(DirSource('~/R/expertise/corpus/english'), 
                 readerControl=list(reader=readPlain, 
                                    language='en_CA',
                                    load=TRUE));

I figure I could probably hack the readDOC reader so that it would use docx2txt or antiword as needed, but this works.



来源:https://stackoverflow.com/questions/16065952/how-do-i-create-a-corpus-of-docx-files-with-tm

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!