How to access OpenXML content by page number?

前端 未结 4 1376
悲&欢浪女
悲&欢浪女 2020-12-03 22:57

Using OpenXML, can I read the document content by page number?

wordDocument.MainDocumentPart.Document.Body gives content of full document.



        
相关标签:
4条回答
  • 2020-12-03 23:14

    You cannot reference OOXML content via page numbering at the OOXML data level alone.

    • Hard page breaks are not the problem; hard page breaks can be counted.
    • Soft page breaks are the problem. These are calculated according to line break and pagination algorithms which are implementation dependent; it is not intrinsic to the OOXML data. There is nothing to count.

    What about w:lastRenderedPageBreak, which is a record of the position of a soft page break at the time the document was last rendered? No, w:lastRenderedPageBreak does not help in general either because:

    • By definition, w:lastRenderedPageBreak position is stale when content has been changed since last opened by a program that paginates its content.
    • In MS Word's implementation, w:lastRenderedPageBreak is known to be unreliable in various circumstances including
      1. when table spans two pages
      2. when next page starts with an empty paragraph
      3. for multi-column layouts with text boxes starting a new column
      4. for large images or long sequences of blank lines

    If you're willing to accept a dependence on Word Automation, with all of its inherent licensing and server operation limitations, then you have a chance of determining page boundaries, page numberings, page counts, etc.

    Otherwise, the only real answer is to move beyond page-based referencing frameworks that are dependent upon proprietary, implementation-specific pagination algorithms.

    0 讨论(0)
  • 2020-12-03 23:17

    This is how I ended up doing it.

      public void OpenWordprocessingDocumentReadonly()
            {
                string filepath = @"C:\...\test.docx";
                // Open a WordprocessingDocument based on a filepath.
                Dictionary<int, string> pageviseContent = new Dictionary<int, string>();
                int pageCount = 0;
                using (WordprocessingDocument wordDocument =
                    WordprocessingDocument.Open(filepath, false))
                {
                    // Assign a reference to the existing document body.  
                    Body body = wordDocument.MainDocumentPart.Document.Body;
                    if (wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text != null)
                    {
                        pageCount = Convert.ToInt32(wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text);
                    }
                    int i = 1;
                    StringBuilder pageContentBuilder = new StringBuilder();
                    foreach (var element in body.ChildElements)
                    {
                        if (element.InnerXml.IndexOf("<w:br w:type=\"page\" />", StringComparison.OrdinalIgnoreCase) < 0)
                        {
                            pageContentBuilder.Append(element.InnerText);
                        }
                        else
                        {
                            pageviseContent.Add(i, pageContentBuilder.ToString());
                            i++;
                            pageContentBuilder = new StringBuilder();
                        }
                        if (body.LastChild == element && pageContentBuilder.Length > 0)
                        {
                            pageviseContent.Add(i, pageContentBuilder.ToString());
                        }
                    }
                }
            }
    

    Downside: This wont work in all scenarios. This will work only when you have a page break, but if you have text extended from page 1 to page 2, there is no identifier to know you are in page two.

    0 讨论(0)
  • 2020-12-03 23:26

    List<Paragraph> Allparagraphs = wp.MainDocumentPart.Document.Body.OfType<Paragraph>().ToList();

    List<Paragraph> PageParagraphs = Allparagraphs.Where (x=>x.Descendants<LastRenderedPageBreak>().Count() ==1) .Select(x => x).Distinct().ToList();

    0 讨论(0)
  • 2020-12-03 23:38

    Rename docx to zip. Open docProps\app.xml file. :

     <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <Properties xmlns="http://schemas.openxmlformats.org/officeDocument/2006/extended-properties" xmlns:vt="http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes">
      <Template>Normal</Template>
      <TotalTime>0</TotalTime>
      <Pages>1</Pages>
      <Words>141</Words>
      <Characters>809</Characters>
      <Application>Microsoft Office Word</Application>
      <DocSecurity>0</DocSecurity>
      <Lines>6</Lines>
      <Paragraphs>1</Paragraphs>
      <ScaleCrop>false</ScaleCrop>
      <HeadingPairs>
        <vt:vector size="2" baseType="variant">
          <vt:variant>
            <vt:lpstr>Название</vt:lpstr>
          </vt:variant>
          <vt:variant>
            <vt:i4>1</vt:i4>
          </vt:variant>
        </vt:vector>
      </HeadingPairs>
      <TitlesOfParts>
        <vt:vector size="1" baseType="lpstr">
          <vt:lpstr/>
        </vt:vector>
      </TitlesOfParts>
      <Company/>
      <LinksUpToDate>false</LinksUpToDate>
      <CharactersWithSpaces>949</CharactersWithSpaces>
      <SharedDoc>false</SharedDoc>
      <HyperlinksChanged>false</HyperlinksChanged>
      <AppVersion>14.0000</AppVersion>
    </Properties>
    

    OpenXML lib reads wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text from <Pages>1</Pages> property . This properies are created only by winword application. if word document changed wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text is not actual. if word document created programmatically the wordDocument.ExtendedFilePropertiesPart is offten null.

    0 讨论(0)
提交回复
热议问题