问题
--update-- maybe someone can assume another possibility so split a .docx
document into its chapters, importing .docx
to R
first of all, I want to give thanks for this awesome forum. I found several solutions for my upcoming issues. But this time I haven't found anything...
However, I have a complex .docx
document, containing an index, formatted to .xml
.
library(XML)
xmlfile <- xmlParse("C:/Users/Documents/stihl.xml", options = HUGE)
topxml <- xmlRoot(xmlfile)
topxml <- xmlSApply(topxml, function(x) xmlSApply(x, xmlValue))
xml_df <- data.frame(t(topxml), row.names = NULL, node)
And other possibilities to read an XML file.
My .docx
document has an index and now I want to extract the several index content. As an .docx
example
1. Introduction
This is an introduction importing XML by R.
2. UserGuide
Userguides are often helpful.
2.1 Style
The style should be always the same.
2.2 Language
I hope my Language is readable, because I'm contacting you from Germany.
As a result it would be nice to receive the content of the seperated chapters, for example stored in a vector.
result
[1]This is an introduction importing XML by R.
[2]Userguides are often helpful.
[3]The style should be always the same.
[4]I hope my Language is readable, because I'm contacting you from Germany.
Maybe there are other possibilities keeping the structure but I mentioned an XML import containing the tree structure as the easiest way.
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<?mso-application progid="Word.Document"?>
<pkg:package xmlns:pkg="http://schemas.microsoft.com/office/2006/xmlPackage">
<pkg:part
pkg:name="/_rels/.rels"
pkg:contentType="application/vnd.openxmlformats-package.relationships+xml"
pkg:padding="512">
<pkg:xmlData>
<Relationships xmlns="http://schemas.openxmlformats.org/package/2006/relationships">
<Relationship
Id="rId3"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/extended-properties"
Target="docProps/app.xml"/>
<Relationship
Id="rId2"
Type="http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties"
Target="docProps/core.xml"/>
<Relationship Id="rId1"
Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument"
Target="word/document.xml"/>
</Relationships>
</pkg:xmlData>
</pkg:part>
<pkg:part
#serveral relationships
</pkg:part>
<pkg:part
pkg:name="/word/document.xml"
pkg:contentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml">
<pkg:xmlData>
<w:document mc:Ignorable="w14 w15 wp14"
xmlns:wpc:http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas
xmlns:mc:http://schemas.openxmlformats.org/markup-compatibility/2006
xmlns:o:urn:schemas-microsoft-com:office:office
xmlns:r:http://schemas.openxmlformats.org/officeDocument/2006/relationships
xmlns:m:http://schemas.openxmlformats.org/officeDocument/2006/math
xmlns:v:urn:schemas-microsoft-com:vml
xmlns:wp14:http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing
xmlns:wp:http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing
xmlns:w10:urn:schemas-microsoft-com:office:word
xmlns:w:http://schemas.openxmlformats.org/wordprocessingml/2006/main
xmlns:w14:http://schemas.microsoft.com/office/word/2010/wordml
xmlns:w15:http://schemas.microsoft.com/office/word/2012/wordml
xmlns:wpg:http://schemas.microsoft.com/office/word/2010/wordprocessingGroup
xmlns:wpi:http://schemas.microsoft.com/office/word/2010/wordprocessingInk
xmlns:wne:http://schemas.microsoft.com/office/word/2006/wordml
xmlns:wps:http://schemas.microsoft.com/office/word/2010/wordprocessingShape
<w:body>
<w:p> ...
</w:p>
<w:p w14:paraId="5BB64FEF" w14:textId="77777777" w:rsidR="005A3789" w:rsidRDefault="005A3789" w:rsidP="005A3789">
<w:pPr>
<w:pStyle w:val="Inhaltsverzeichnisberschrift"/>
</w:pPr>
<w:r>
<w:lastRenderedPageBreak/>
<w:t>Inhaltsverzeichnis</w:t>
</w:r>
</w:p>
'Inhaltsverzeichnis' is the titel of my index. The path is package -> 3.part -> xmldata -> document -> body -> p
The information is stored here for example
<w:p w14:paraId="15ECF978" w14:textId="77777777" w:rsidR="009B5500" w:rsidRDefault="005A3789">
<w:pPr>
<w:pStyle w:val="Verzeichnis1"/>
<w:rPr>
<w:rFonts w:eastAsiaTheme="minorEastAsia"/>
<w:b w:val="0"/>
<w:noProof/>
<w:color w:val="auto"/>
<w:lang w:eastAsia="de-DE"/>
</w:rPr>
</w:pPr>
<w:r>
<w:rPr>
<w:b w:val="0"/>
</w:rPr>
<w:fldChar w:fldCharType="begin"/>
</w:r>
<w:r>
<w:instrText xml:space="preserve"> TOC \o "1-4" \h \z \u
</w:instrText>
</w:r>
<w:r>
<w:rPr>
<w:b w:val="0"/>
</w:rPr>
<w:fldChar w:fldCharType="separate"/>
</w:r>
<w:hyperlink w:anchor="_Toc474825312" w:history="1">
<w:r w:rsidR="009B5500" w:rsidRPr="009D0220"><w:rPr>
<w:rStyle w:val="Hyperlink"/>
<w:noProof/>
</w:rPr>
**<w:t>1</w:t>**
</w:r>
<w:r w:rsidR="009B5500"><w:rPr><w:rFonts w:eastAsiaTheme="minorEastAsia"/>
<w:b w:val="0"/>
<w:noProof/>
<w:color w:val="auto"/>
<w:lang w:eastAsia="de-DE"/>
</w:rPr><w:tab/>
</w:r>
<w:r w:rsidR="009B5500" w:rsidRPr="009D0220">
<w:rPr>
<w:rStyle w:val="Hyperlink"/>
<w:noProof/>
</w:rPr>
**<w:t>Management Summary</w:t>**
</w:r>
<w:r w:rsidR="009B5500">
<w:rPr>
<w:noProof/>
<w:webHidden/>
</w:rPr>
<w:tab/>
</w:r>
<w:r w:rsidR="009B5500">
<w:rPr>
<w:noProof/>
<w:webHidden/>
</w:rPr><w:fldChar w:fldCharType="begin"/>
</w:r>
<w:r w:rsidR="009B5500">
<w:rPr>
<w:noProof/>
<w:webHidden/>
</w:rPr>
<w:instrText xml:space="preserve"> PAGEREF _Toc474825312 \h </w:instrText>
</w:r>
<w:r w:rsidR="009B5500">
<w:rPr>
<w:noProof/>
<w:webHidden/>
</w:rPr>
</w:r>
<w:r w:rsidR="009B5500">
<w:rPr>
<w:noProof/>
<w:webHidden/>
</w:rPr>
<w:fldChar w:fldCharType="separate"/>
</w:r>
<w:r w:rsidR="009B5500">
<w:rPr>
<w:noProof/>
<w:webHidden/>
</w:rPr>
**<w:t>6</w:t>**
</w:r>
<w:r w:rsidR="009B5500">
<w:rPr>
<w:noProof/>
<w:webHidden/>
</w:rPr>
<w:fldChar w:fldCharType="end"/>
</w:r>
</w:hyperlink>
</w:p>
This is the first entry of the index, 1. Management Summary 6
回答1:
We can use:
library(xml2)
library(magrittr)
x <- read_xml("path/to/file.xml")
titles <- xml_find_all(x,
"/pkg:package//pkg:part/pkg:xmlData/w:document/w:body/w:p/w:hyperlink/w:r/w:t") %>%
xml_text() %>%
matrix(ncol = 3, byrow = T) %>%
as.data.frame()
colnames(titles)<- c('numChapter', 'title', 'numPage')
This retrives the text inside all the nodes corresponding to that xpath.
Based on your given example that xpath contains (what I suppose are) the numChapter
, its title
and its numPage
.
As noted this will give an error if the xml is not well formed and/or some namespaces are missing.
Hope this helps
来源:https://stackoverflow.com/questions/42225937/import-a-complex-docx-file-as-xml-and-extract-the-chapters