Finding implicit page break in word document using xml parsing

ⅰ亾dé卋堺 提交于 2019-12-11 10:22:17

问题


I need to extract the first page content of a word document. If I look at the openxml for a wordML document I could see things like: <w:lastRenderedPageBreak /> or it would seem <w:br w:type="page" /> <w:br w:type="page" /> occurs when user enters an hard page break. I don't understand in what all cases <w:lastRenderedPageBreak /> occurs. It occurs in some of the implict page break cases but not all. For example: I typed some text and then pressed enter several times and cursor goes to the next page and if I still press enter several times in the new page this is what I get

    **DOCUMENT.XML**
- <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A">
-   <w:r>
      <w:t xml:space="preserve">All my fun TEXT.</w:t>
</w:r>
</w:p>
  <w:p w:rsidR="0061403F" w:rsidRDefault="0061403F" w:rsidP="00266B4A" />
  <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A" />
  <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A" />
  <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A" />
  <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A" />
  <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A" />   <-{page break}
  <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A" />
  <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A" />
  <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A" />
- <w:p w:rsidR="00266B4A" w:rsidRDefault="00266B4A" w:rsidP="00266B4A">
-     <w:r>
         <w:t xml:space="preserve">All my fun TEXT.</w:t>
  </w:r>
</w:p>

As you could see even though the cursor goes to the next page as I type enter,there is no clue regarding this activity in document.xml file in extracted word document folder. Can someone help me in finding the implicit page break in the word document so that I can extract the content of the first page of the document? If there is no way of detecting particular page content in openxml, how does pdf conversion tools work where each word document page is converted as a page in pdf?

Please do not suggest using APIs like POI which have no provision to extract particular page content. Edit : The reason for finding the implicit page break is because my task involves extracting the cover image in a word document.The heuristics that im following is "if the first page of the document contains only an image then it is a cover image otherwise there is no cover image ".So i need to get the content of the first page alone and check if it has only an image.How can i do it ?


回答1:


The short answer is that it's not possible to do what you want by examining the XML. The page rendering engine of Word (or a PDF converter) is what determines where the page breaks. The XML simply describes the content to be "flowed" by the rendering engine.



来源:https://stackoverflow.com/questions/24720972/finding-implicit-page-break-in-word-document-using-xml-parsing

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!