How to programmatically iterate through subscripts,superscripts and equations found in a Word document

允我心安 提交于 2019-12-06 06:20:24

问题


I have a few Word documents, each containing a few hundreds of pages of scientific data which includes:

  • Chemical formulae (H2SO4 with all proper subscripts & superscripts)
  • Scientific numbers (exponents formatted using superscripts)
  • Lots of Mathematical Equations. Written using mathematical equation editor in Word.

Problem is, storing this data in the form of Word is not efficient for us. So we want to store all this information in a Database (MySQL). We want to convert these formatting to LaTex.

Is there any way to iterate through all the subcripts & superscripts & Equations using VBA?

What about iterating through mathematical equations?


回答1:


Based on your comment on Michael's answer

No! I just want to replace content in the subscript with _{ subscriptcontent } and similarly superscript content with ^{ superscriptcontent }. That would be the Tex equivalent. Now, I'll just copy everything to a text file which will remove the formatting but leaves these characters. Problem solved. But for that I need to access the subscript & superscript objects of document

Sub sampler()
    Selection.HomeKey wdStory
    With Selection.find
        .ClearFormatting
        .Replacement.ClearFormatting
        .Font.Superscript = True
        .Replacement.Text = "^^{^&}"
        .Execute Replace:=wdReplaceAll
        .Font.Subscript = True
        .Replacement.Text = "_{^&}"
        .Execute Replace:=wdReplaceAll
    End With
End Sub

EDIT

Or If you also want to convert OMaths to TeX / LaTeX, then do something like:

  • Iterate over Omaths > convert each to MathML > [save MathML to disk] + [put some mark-up in doc describing MathML file's reference in place of OMath] > convert Word files as text
  • Now prepare a converter like MathParser and convert MathML files to LateX.
  • Parse text file > search and replace LaTeX code accordingly.

For a completely different idea visit David Carlisle's blog, that might interest you.

UPDATE

The module

Option Explicit

'This module requires the following references:
'Microsoft Scripting Runtime
'MicroSoft XML, v6.0

Private fso As New Scripting.FileSystemObject
Private omml2mml$, mml2Tex$

Public Function ProcessFile(fpath$) As Boolean
    'convPath set to my system at (may vary on your system):
    omml2mml = "c:\program files\microsoft office\office14\omml2mml.xsl"
    'download: http://prdownloads.sourceforge.net/xsltml/xsltml_2.0.zip
    'unzip at «c:\xsltml_2.0»
    mml2Tex = "c:\xsltml_2.0\mmltex.xsl"

    Documents.Open fpath

    'Superscript + Subscript
    Selection.HomeKey wdStory
    With Selection.find
        .ClearFormatting
        .Replacement.ClearFormatting

        'to make sure no paragraph should contain any emphasis
        .Text = "^p"
        .Replacement.Text = "^&"
        .Replacement.Font.Italic = False
        .Replacement.Font.Bold = False
        .Replacement.Font.Superscript = False
        .Replacement.Font.Subscript = False
        .Replacement.Font.SmallCaps = False
        .Execute Replace:=wdReplaceAll


        .Font.Italic = True
        .Replacement.Text = "\textit{^&}"
        .Execute Replace:=wdReplaceAll

        .Font.Bold = True
        .Replacement.Text = "\textbf{^&}"
        .Execute Replace:=wdReplaceAll

        .Font.SmallCaps = True
        .Replacement.Text = "\textsc{^&}"
        .Execute Replace:=wdReplaceAll


        .Font.Superscript = True
        .Replacement.Text = "^^{^&}"
        .Execute Replace:=wdReplaceAll


        .Font.Subscript = True
        .Replacement.Text = "_{^&}"
        .Execute Replace:=wdReplaceAll
    End With

    Dim dict As New Scripting.Dictionary
    Dim om As OMath, t, counter&, key$
    key = Replace(LCase(Dir(fpath)), " ", "_omath_")
    counter = 0

    For Each om In ActiveDocument.OMaths
        DoEvents
        counter = counter + 1
        Dim tKey$, texCode$
        tKey = "<" & key & "_" & counter & ">"
        t = om.Range.WordOpenXML

        texCode = TransformString(TransformString(CStr(t), omml2mml), mml2Tex)
        om.Range.Select
        Selection.Delete
        Selection.Text = tKey

        dict.Add tKey, texCode

    Next om

    Dim latexDoc$, oPath$
    latexDoc = "\documentclass[10pt]{article}" & vbCrLf & _
                "\usepackage[utf8]{inputenc} % set input encoding" & vbCrLf & _
                "\usepackage{amsmath,amssymb}" & vbCrLf & _
                "\begin{document}" & vbCrLf & _
                "###" & vbCrLf & _
                "\end{document}"

    oPath = StrReverse(Mid(StrReverse(fpath), InStr(StrReverse(fpath), "."))) & "tex"
    'ActiveDocument.SaveAs FileName:=oPath, FileFormat:=wdFormatText, Encoding:=1200
    'ActiveDocument.SaveAs FileName:=oPath, FileFormat:=wdFormatText, Encoding:=65001
    ActiveDocument.Close

    Dim c$, i
    c = fso.OpenTextFile(oPath).ReadAll()

    counter = 0

    For Each i In dict
        counter = counter + 1
        Dim findText$, replaceWith$
        findText = CStr(i)
        replaceWith = dict.item(i)
        c = Replace(c, findText, replaceWith, 1, 1, vbTextCompare)
    Next i

    latexDoc = Replace(latexDoc, "###", c)

    Dim ost As TextStream
    Set ost = fso.CreateTextFile(oPath)
    ost.Write latexDoc

    ProcessFile = True


End Function

Private Function CreateDOM()
    Dim dom As New DOMDocument60
    With dom
        .async = False
        .validateOnParse = False
        .resolveExternals = False
    End With
    Set CreateDOM = dom
End Function

Private Function TransformString(xmlString$, xslPath$) As String
    Dim xml, xsl, out
    Set xml = CreateDOM
    xml.LoadXML xmlString
    Set xsl = CreateDOM
    xsl.Load xslPath
    out = xml.transformNode(xsl)
    TransformString = out
End Function

The calling(from immediate window):

?ProcessFile("c:\test.doc")

The result would be created as test.tex in c:\.


The module may need to fix some places. If so let me know.




回答2:


The Document object in Word has a oMaths collection, which represents all oMath objects in the document. The oMath object contains the Functions method which will return a collection of Functions within the oMath object. So, the equations shouldn't be that big of an issue.

I imagine you want to capture more than just the subscripts and superscripts, though, that you would want the entire equation containing those sub and superscripts. That could be more challenging, as you'd have to define a starting and ending point. If you were to use the .Find method to find the subscripts and then select everything between the first space character before it and the first space character after it, that might work, but only if your equation contained no spaces.




回答3:


This VBA sub should go through every text character in your document and remove the superscript and subscript while inserting the LaTeX notation.

Public Sub LatexConversion()

Dim myRange As Word.Range, myChr
For Each myRange In ActiveDocument.StoryRanges
  Do
    For Each myChr In myRange.Characters

        If myChr.Font.Superscript = True Then
            myChr.Font.Superscript = False
            myChr.InsertBefore "^"
        End If

        If myChr.Font.Subscript = True Then
            myChr.Font.Subscript = False
            myChr.InsertBefore "_"
        End If

    Next
    Set myRange = myRange.NextStoryRange
  Loop Until myRange Is Nothing
Next
End Sub

If some equations were created with Word's built in equation editor or via building blocks (Word 2010/2007) and exist inside content controls the above will not work. These equations will either require separate VBA conversion code or manual conversion to text only equations prior to executing the above.




回答4:


C# implemetation of OpenMath (OMath) to LaTex using Open XML SDK. Download MMLTEX XSL files from here http://sourceforge.net/projects/xsltml/

    public void OMathTolaTeX()
    {
        string OMath = "";
        string MathML = "";
        string LaTex = "";
        XslCompiledTransform xslTransform = new XslCompiledTransform();
        // The MML2OMML.xsl file is located under 
        // %ProgramFiles%\Microsoft Office\Office12\
        // Copy to Local folder
        xslTransform.Load(@"D:\OMML2MML.XSL");
        using (WordprocessingDocument wordDoc =
                  WordprocessingDocument.Open("test.docx", true))
        {
            OpenXmlElement doc = wordDoc.MainDocumentPart.Document.Body;

            foreach (var par in doc.Descendants<Paragraph>())
            {
               var math in par.Descendants<DocumentFormat.OpenXml.Math.Paragraph>().FirstOrDefault();
               File.WriteAllText("D:\\openmath.xml", math.OuterXml);
               OMath = math.OuterXml;

           }
        }
        //Load OMath string into stream
        using (XmlReader reader = XmlReader.Create(new StringReader(OMath)))
        {
            using (MemoryStream ms = new MemoryStream())
            {
                XmlWriterSettings settings = xslTransform.OutputSettings.Clone();

                // Configure xml writer to omit xml declaration.
                settings.ConformanceLevel = ConformanceLevel.Fragment;
                settings.OmitXmlDeclaration = true;

                XmlWriter xw = XmlWriter.Create(ms, settings);

                // Transform our MathML to OfficeMathML
                xslTransform.Transform(reader, xw);
                ms.Seek(0, SeekOrigin.Begin);

                StreamReader sr = new StreamReader(ms, Encoding.UTF8);

                MathML= sr.ReadToEnd();

                Console.Out.WriteLine(MathML);
                File.WriteAllText("d:\\MATHML.xml", MathML);
                // Create a OfficeMath instance from the
                // OfficeMathML xml.
                sr.Close();
                reader.Close();
                ms.Close();

                // Add the OfficeMath instance to our 
                // word template.

            }
        }
        var xmlResolver = new XmlUrlResolver();
        xslTransform = new XslCompiledTransform();
        XsltSettings xsltt = new XsltSettings(true, true);
        // The mmtex.xsl file is to convert to Tex 
        xslTransform.Load("mmltex.xsl", xsltt, xmlResolver);

        using (XmlReader reader = XmlReader.Create(new StringReader(MathML)))
        {
            using (MemoryStream ms = new MemoryStream())
            {
                XmlWriterSettings settings = xslTransform.OutputSettings.Clone();

                // Configure xml writer to omit xml declaration.
                settings.ConformanceLevel = ConformanceLevel.Fragment;
                settings.OmitXmlDeclaration = true;

                XmlWriter xw = XmlWriter.Create(ms, settings);

                // Transform our MathML to OfficeMathML
                xslTransform.Transform(reader, xw);
                ms.Seek(0, SeekOrigin.Begin);

                StreamReader sr = new StreamReader(ms, Encoding.UTF8);

                LaTex = sr.ReadToEnd();
                sr.Close();
                reader.Close();
                ms.Close();
                Console.Out.WriteLine(LaTex);
                File.WriteAllText("d:\\Latex.txt", LaTex);
                // Create a OfficeMath instance from the
                // OfficeMathML xml.


                // Add the OfficeMath instance to our 
                // word template.

            }
        }
    }

Hope this helps for C# developers.



来源:https://stackoverflow.com/questions/11565839/how-to-programmatically-iterate-through-subscripts-superscripts-and-equations-fo

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!