import a complex .docx file as .xml and extract the chapters

问题

--update-- maybe someone can assume another possibility so split a .docxdocument into its chapters, importing .docxto R

first of all, I want to give thanks for this awesome forum. I found several solutions for my upcoming issues. But this time I haven't found anything...

However, I have a complex .docx document, containing an index, formatted to .xml.

library(XML)
xmlfile <- xmlParse("C:/Users/Documents/stihl.xml", options = HUGE)

topxml <- xmlRoot(xmlfile)

topxml <- xmlSApply(topxml, function(x) xmlSApply(x, xmlValue))
xml_df <- data.frame(t(topxml), row.names = NULL, node)

And other possibilities to read an XML file. My .docx document has an index and now I want to extract the several index content. As an .docx example

1. Introduction  
   This is an introduction importing XML by R.  
2. UserGuide  
   Userguides are often helpful.  
2.1 Style  
   The style should be always the same.  
2.2 Language  
   I hope my Language is readable, because I'm contacting you from Germany.

As a result it would be nice to receive the content of the seperated chapters, for example stored in a vector.

result 
[1]This is an introduction importing XML by R.
[2]Userguides are often helpful.
[3]The style should be always the same.
[4]I hope my Language is readable, because I'm contacting you from Germany.

Maybe there are other possibilities keeping the structure but I mentioned an XML import containing the tree structure as the easiest way.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<?mso-application progid="Word.Document"?>
<pkg:package xmlns:pkg="http://schemas.microsoft.com/office/2006/xmlPackage">

  <pkg:part 
    pkg:name="/_rels/.rels" 
    pkg:contentType="application/vnd.openxmlformats-package.relationships+xml" 
    pkg:padding="512">
    <pkg:xmlData>
       <Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
          <Relationship 
           Id="rId3" 
           Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/extended-properties" 
           Target="docProps/app.xml"/>
          <Relationship 
           Id="rId2" 
           Type="http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties" 
           Target="docProps/core.xml"/>
          <Relationship Id="rId1" 
           Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument" 
           Target="word/document.xml"/>
       </Relationships>
    </pkg:xmlData>
  </pkg:part>

  <pkg:part 
   #serveral relationships
  </pkg:part>

  <pkg:part 
    pkg:name="/word/document.xml" 
    pkg:contentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml">
     <pkg:xmlData>

      <w:document mc:Ignorable="w14 w15 wp14" 




    xmlns:wpc:http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas
   xmlns:mc:http://schemas.openxmlformats.org/markup-compatibility/2006
   xmlns:o:urn:schemas-microsoft-com:office:office
    xmlns:r:http://schemas.openxmlformats.org/officeDocument/2006/relationships
    xmlns:m:http://schemas.openxmlformats.org/officeDocument/2006/math
    xmlns:v:urn:schemas-microsoft-com:vml
    xmlns:wp14:http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing
    xmlns:wp:http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing
    xmlns:w10:urn:schemas-microsoft-com:office:word
    xmlns:w:http://schemas.openxmlformats.org/wordprocessingml/2006/main
    xmlns:w14:http://schemas.microsoft.com/office/word/2010/wordml
   xmlns:w15:http://schemas.microsoft.com/office/word/2012/wordml
    xmlns:wpg:http://schemas.microsoft.com/office/word/2010/wordprocessingGroup
    xmlns:wpi:http://schemas.microsoft.com/office/word/2010/wordprocessingInk
    xmlns:wne:http://schemas.microsoft.com/office/word/2006/wordml
   xmlns:wps:http://schemas.microsoft.com/office/word/2010/wordprocessingShape

         <w:body>

           <w:p> ...
          </w:p>

          <w:p w14:paraId="5BB64FEF" w14:textId="77777777" w:rsidR="005A3789" w:rsidRDefault="005A3789" w:rsidP="005A3789">
           <w:pPr>
            <w:pStyle w:val="Inhaltsverzeichnisberschrift"/>
           </w:pPr>
           <w:r>
            <w:lastRenderedPageBreak/>
            <w:t>Inhaltsverzeichnis</w:t>
           </w:r>
          </w:p>

'Inhaltsverzeichnis' is the titel of my index. The path is package -> 3.part -> xmldata -> document -> body -> p

The information is stored here for example

<w:p w14:paraId="15ECF978" w14:textId="77777777" w:rsidR="009B5500" w:rsidRDefault="005A3789">
<w:pPr>
<w:pStyle w:val="Verzeichnis1"/>
<w:rPr>
<w:rFonts w:eastAsiaTheme="minorEastAsia"/>
<w:b w:val="0"/>
<w:noProof/>
<w:color w:val="auto"/>
<w:lang w:eastAsia="de-DE"/>
</w:rPr>
</w:pPr>
<w:r>
<w:rPr>
<w:b w:val="0"/>
</w:rPr>
<w:fldChar w:fldCharType="begin"/>
</w:r>
<w:r>
<w:instrText xml:space="preserve"> TOC \o "1-4" \h \z \u 
</w:instrText>
</w:r>
<w:r>
<w:rPr>
<w:b w:val="0"/>
</w:rPr>
<w:fldChar w:fldCharType="separate"/>
</w:r>
<w:hyperlink w:anchor="_Toc474825312" w:history="1">
<w:r w:rsidR="009B5500" w:rsidRPr="009D0220"><w:rPr>
<w:rStyle w:val="Hyperlink"/>
<w:noProof/>
</w:rPr>
                  **<w:t>1</w:t>**
</w:r>
<w:r w:rsidR="009B5500"><w:rPr><w:rFonts w:eastAsiaTheme="minorEastAsia"/>
<w:b w:val="0"/>
<w:noProof/>
<w:color w:val="auto"/>
<w:lang w:eastAsia="de-DE"/>
</w:rPr><w:tab/>
</w:r>
<w:r w:rsidR="009B5500" w:rsidRPr="009D0220">
<w:rPr>
<w:rStyle w:val="Hyperlink"/>
<w:noProof/>
</w:rPr>
                  **<w:t>Management Summary</w:t>**
</w:r>
<w:r w:rsidR="009B5500">
<w:rPr>
<w:noProof/>
<w:webHidden/>
</w:rPr>
<w:tab/>
</w:r>
<w:r w:rsidR="009B5500">
<w:rPr>
<w:noProof/>
<w:webHidden/>
</w:rPr><w:fldChar w:fldCharType="begin"/>
</w:r>
<w:r w:rsidR="009B5500">
<w:rPr>
<w:noProof/>
<w:webHidden/>
</w:rPr>
<w:instrText xml:space="preserve"> PAGEREF _Toc474825312 \h </w:instrText>
</w:r>
<w:r w:rsidR="009B5500">
<w:rPr>
<w:noProof/>
<w:webHidden/>
</w:rPr>
</w:r>
<w:r w:rsidR="009B5500">
<w:rPr>
<w:noProof/>
<w:webHidden/>
</w:rPr>
<w:fldChar w:fldCharType="separate"/>
</w:r>
<w:r w:rsidR="009B5500">
<w:rPr>
<w:noProof/>
<w:webHidden/>
</w:rPr>
               **<w:t>6</w:t>**
</w:r>
<w:r w:rsidR="009B5500">
<w:rPr>
<w:noProof/>
<w:webHidden/>
</w:rPr>
<w:fldChar w:fldCharType="end"/>
</w:r>
</w:hyperlink>
</w:p>

This is the first entry of the index, 1. Management Summary 6

回答1:

We can use:

library(xml2)
library(magrittr)

x <- read_xml("path/to/file.xml")

titles <- xml_find_all(x, 
               "/pkg:package//pkg:part/pkg:xmlData/w:document/w:body/w:p/w:hyperlink/w:r/w:t") %>%  
         xml_text() %>% 
         matrix(ncol = 3, byrow = T) %>% 
         as.data.frame()

colnames(titles)<- c('numChapter', 'title', 'numPage')

This retrives the text inside all the nodes corresponding to that xpath.

Based on your given example that xpath contains (what I suppose are) the numChapter, its title and its numPage.

As noted this will give an error if the xml is not well formed and/or some namespaces are missing.

Hope this helps

来源：https://stackoverflow.com/questions/42225937/import-a-complex-docx-file-as-xml-and-extract-the-chapters

标签

xml

indexing

extract