Simplify/ Clean up XML of a DOCX word document

断了今生、忘了曾经 提交于 2019-12-02 22:33:43
K B

I have found a solution: the Open XML PowerTools Markup Simplifier.

I followed the steps described at http://ericwhite.com/blog/2011/03/09/getting-started-with-open-xml-powertools-markup-simplifier/, but it didn't work 1:1 (maybe because it is now version 2.2 of Power Tools?). So, I compiled PowerTools 2.2 in "Release" mode and made a reference to the OpenXmlPowerTools.dll in my TestMarkupSimplifier.csproj. In the Program.cs I only changed the path to my DOCX file. I ran the program once and my document seems to be fairly clean now.

Code quoted from Eric's blog in the link above:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using OpenXmlPowerTools;
using DocumentFormat.OpenXml.Packaging;

class Program
{
    static void Main(string[] args)
    {
        using (WordprocessingDocument doc = WordprocessingDocument.Open("Test.docx", true))
        {
            SimplifyMarkupSettings settings = new SimplifyMarkupSettings
            {
                RemoveComments = true,
                RemoveContentControls = true,
                RemoveEndAndFootNotes = true,
                RemoveFieldCodes = false
                RemoveLastRenderedPageBreak = true,
                RemovePermissions = true,
                RemoveProof = true,
                RemoveRsidInfo = true,
                RemoveSmartTags = true,
                RemoveSoftHyphens = true,
                ReplaceTabsWithSpaces = true,
            };
            MarkupSimplifier.SimplifyMarkup(doc, settings);
        }
    }
}

I do not know of a way to cleanup the XML, but I've always used #placeholder for my placeholder text and that seems to stay in one run more than any other placeholder text I've tried in the past. It seems the longer the placeholder text, the more likely it is to be split into multiple runs.

Amadeus Sánchez

You need to get rid of the Rsid information. According to this page Rsid information

enables merging of two documents that have forked.

You need to install in order to run the sample code below. The easiest way to do that is to run the following in the Package Manager Console

Install-Package OpenXmlPowerTools

Then you will be all set to run the following code. (Assuming that you already have a "Test.docx" file added to your document. If you are using Visual Studio, you need to make sure that you have a copy of the file in either the Debug or Release folder according to your build mode.)

//Sample code to remove Rsid information from a "Test.docx" document

 using (WordprocessingDocument doc = WordprocessingDocument.Open("Test.docx", true))
        {
            SimplifyMarkupSettings settings = new SimplifyMarkupSettings
            {  
                RemoveRsidInfo = true 
            };
            MarkupSimplifier.SimplifyMarkup(doc, settings);
        }

This will remove Rsid information that may get in the way in the process of manipulating Word files.

For those looking for manual non-programmatic solution:

http://www.translationtribulations.com/2010/06/cleaning-up-superfluous-tags-in-docx.html

I've tested that free-trial of memoQ 2014 can indeed be used as a bulky workaround for cleaning Word spell tags.

Still looking for an easier ready-out-of-the-box tool.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!