问题
I have a few Word documents, each containing a few hundreds of pages of scientific data which includes:
- Chemical formulae (H2SO4 with all proper subscripts & superscripts)
- Scientific numbers (exponents formatted using superscripts)
- Lots of Mathematical Equations. Written using mathematical equation editor in Word.
Problem is, storing this data in the form of Word is not efficient for us. So we want to store all this information in a Database (MySQL). We want to convert these formatting to LaTex.
Is there any way to iterate through all the subcripts & superscripts & Equations using VBA?
What about iterating through mathematical equations?
回答1:
Based on your comment on Michael's answer
No! I just want to replace content in the subscript with _{ subscriptcontent } and similarly superscript content with ^{ superscriptcontent }. That would be the Tex equivalent. Now, I'll just copy everything to a text file which will remove the formatting but leaves these characters. Problem solved. But for that I need to access the subscript & superscript objects of document
Sub sampler()
Selection.HomeKey wdStory
With Selection.find
.ClearFormatting
.Replacement.ClearFormatting
.Font.Superscript = True
.Replacement.Text = "^^{^&}"
.Execute Replace:=wdReplaceAll
.Font.Subscript = True
.Replacement.Text = "_{^&}"
.Execute Replace:=wdReplaceAll
End With
End Sub
EDIT
Or If you also want to convert OMaths
to TeX / LaTeX
, then do something like:
- Iterate over Omaths > convert each to MathML > [save MathML to disk] + [put some mark-up in doc describing MathML file's reference in place of OMath] > convert Word files as text
- Now prepare a converter like MathParser and convert MathML files to LateX.
- Parse text file > search and replace LaTeX code accordingly.
For a completely different idea visit David Carlisle's blog, that might interest you.
UPDATE
The module
Option Explicit
'This module requires the following references:
'Microsoft Scripting Runtime
'MicroSoft XML, v6.0
Private fso As New Scripting.FileSystemObject
Private omml2mml$, mml2Tex$
Public Function ProcessFile(fpath$) As Boolean
'convPath set to my system at (may vary on your system):
omml2mml = "c:\program files\microsoft office\office14\omml2mml.xsl"
'download: http://prdownloads.sourceforge.net/xsltml/xsltml_2.0.zip
'unzip at «c:\xsltml_2.0»
mml2Tex = "c:\xsltml_2.0\mmltex.xsl"
Documents.Open fpath
'Superscript + Subscript
Selection.HomeKey wdStory
With Selection.find
.ClearFormatting
.Replacement.ClearFormatting
'to make sure no paragraph should contain any emphasis
.Text = "^p"
.Replacement.Text = "^&"
.Replacement.Font.Italic = False
.Replacement.Font.Bold = False
.Replacement.Font.Superscript = False
.Replacement.Font.Subscript = False
.Replacement.Font.SmallCaps = False
.Execute Replace:=wdReplaceAll
.Font.Italic = True
.Replacement.Text = "\textit{^&}"
.Execute Replace:=wdReplaceAll
.Font.Bold = True
.Replacement.Text = "\textbf{^&}"
.Execute Replace:=wdReplaceAll
.Font.SmallCaps = True
.Replacement.Text = "\textsc{^&}"
.Execute Replace:=wdReplaceAll
.Font.Superscript = True
.Replacement.Text = "^^{^&}"
.Execute Replace:=wdReplaceAll
.Font.Subscript = True
.Replacement.Text = "_{^&}"
.Execute Replace:=wdReplaceAll
End With
Dim dict As New Scripting.Dictionary
Dim om As OMath, t, counter&, key$
key = Replace(LCase(Dir(fpath)), " ", "_omath_")
counter = 0
For Each om In ActiveDocument.OMaths
DoEvents
counter = counter + 1
Dim tKey$, texCode$
tKey = "<" & key & "_" & counter & ">"
t = om.Range.WordOpenXML
texCode = TransformString(TransformString(CStr(t), omml2mml), mml2Tex)
om.Range.Select
Selection.Delete
Selection.Text = tKey
dict.Add tKey, texCode
Next om
Dim latexDoc$, oPath$
latexDoc = "\documentclass[10pt]{article}" & vbCrLf & _
"\usepackage[utf8]{inputenc} % set input encoding" & vbCrLf & _
"\usepackage{amsmath,amssymb}" & vbCrLf & _
"\begin{document}" & vbCrLf & _
"###" & vbCrLf & _
"\end{document}"
oPath = StrReverse(Mid(StrReverse(fpath), InStr(StrReverse(fpath), "."))) & "tex"
'ActiveDocument.SaveAs FileName:=oPath, FileFormat:=wdFormatText, Encoding:=1200
'ActiveDocument.SaveAs FileName:=oPath, FileFormat:=wdFormatText, Encoding:=65001
ActiveDocument.Close
Dim c$, i
c = fso.OpenTextFile(oPath).ReadAll()
counter = 0
For Each i In dict
counter = counter + 1
Dim findText$, replaceWith$
findText = CStr(i)
replaceWith = dict.item(i)
c = Replace(c, findText, replaceWith, 1, 1, vbTextCompare)
Next i
latexDoc = Replace(latexDoc, "###", c)
Dim ost As TextStream
Set ost = fso.CreateTextFile(oPath)
ost.Write latexDoc
ProcessFile = True
End Function
Private Function CreateDOM()
Dim dom As New DOMDocument60
With dom
.async = False
.validateOnParse = False
.resolveExternals = False
End With
Set CreateDOM = dom
End Function
Private Function TransformString(xmlString$, xslPath$) As String
Dim xml, xsl, out
Set xml = CreateDOM
xml.LoadXML xmlString
Set xsl = CreateDOM
xsl.Load xslPath
out = xml.transformNode(xsl)
TransformString = out
End Function
The calling(from immediate window):
?ProcessFile("c:\test.doc")
The result would be created as test.tex
in c:\
.
The module may need to fix some places. If so let me know.
回答2:
The Document object in Word has a oMaths collection, which represents all oMath objects in the document. The oMath object contains the Functions method which will return a collection of Functions within the oMath object. So, the equations shouldn't be that big of an issue.
I imagine you want to capture more than just the subscripts and superscripts, though, that you would want the entire equation containing those sub and superscripts. That could be more challenging, as you'd have to define a starting and ending point. If you were to use the .Find method to find the subscripts and then select everything between the first space character before it and the first space character after it, that might work, but only if your equation contained no spaces.
回答3:
This VBA sub should go through every text character in your document and remove the superscript and subscript while inserting the LaTeX notation.
Public Sub LatexConversion()
Dim myRange As Word.Range, myChr
For Each myRange In ActiveDocument.StoryRanges
Do
For Each myChr In myRange.Characters
If myChr.Font.Superscript = True Then
myChr.Font.Superscript = False
myChr.InsertBefore "^"
End If
If myChr.Font.Subscript = True Then
myChr.Font.Subscript = False
myChr.InsertBefore "_"
End If
Next
Set myRange = myRange.NextStoryRange
Loop Until myRange Is Nothing
Next
End Sub
If some equations were created with Word's built in equation editor or via building blocks (Word 2010/2007) and exist inside content controls the above will not work. These equations will either require separate VBA conversion code or manual conversion to text only equations prior to executing the above.
回答4:
C# implemetation of OpenMath (OMath) to LaTex using Open XML SDK. Download MMLTEX XSL files from here http://sourceforge.net/projects/xsltml/
public void OMathTolaTeX()
{
string OMath = "";
string MathML = "";
string LaTex = "";
XslCompiledTransform xslTransform = new XslCompiledTransform();
// The MML2OMML.xsl file is located under
// %ProgramFiles%\Microsoft Office\Office12\
// Copy to Local folder
xslTransform.Load(@"D:\OMML2MML.XSL");
using (WordprocessingDocument wordDoc =
WordprocessingDocument.Open("test.docx", true))
{
OpenXmlElement doc = wordDoc.MainDocumentPart.Document.Body;
foreach (var par in doc.Descendants<Paragraph>())
{
var math in par.Descendants<DocumentFormat.OpenXml.Math.Paragraph>().FirstOrDefault();
File.WriteAllText("D:\\openmath.xml", math.OuterXml);
OMath = math.OuterXml;
}
}
//Load OMath string into stream
using (XmlReader reader = XmlReader.Create(new StringReader(OMath)))
{
using (MemoryStream ms = new MemoryStream())
{
XmlWriterSettings settings = xslTransform.OutputSettings.Clone();
// Configure xml writer to omit xml declaration.
settings.ConformanceLevel = ConformanceLevel.Fragment;
settings.OmitXmlDeclaration = true;
XmlWriter xw = XmlWriter.Create(ms, settings);
// Transform our MathML to OfficeMathML
xslTransform.Transform(reader, xw);
ms.Seek(0, SeekOrigin.Begin);
StreamReader sr = new StreamReader(ms, Encoding.UTF8);
MathML= sr.ReadToEnd();
Console.Out.WriteLine(MathML);
File.WriteAllText("d:\\MATHML.xml", MathML);
// Create a OfficeMath instance from the
// OfficeMathML xml.
sr.Close();
reader.Close();
ms.Close();
// Add the OfficeMath instance to our
// word template.
}
}
var xmlResolver = new XmlUrlResolver();
xslTransform = new XslCompiledTransform();
XsltSettings xsltt = new XsltSettings(true, true);
// The mmtex.xsl file is to convert to Tex
xslTransform.Load("mmltex.xsl", xsltt, xmlResolver);
using (XmlReader reader = XmlReader.Create(new StringReader(MathML)))
{
using (MemoryStream ms = new MemoryStream())
{
XmlWriterSettings settings = xslTransform.OutputSettings.Clone();
// Configure xml writer to omit xml declaration.
settings.ConformanceLevel = ConformanceLevel.Fragment;
settings.OmitXmlDeclaration = true;
XmlWriter xw = XmlWriter.Create(ms, settings);
// Transform our MathML to OfficeMathML
xslTransform.Transform(reader, xw);
ms.Seek(0, SeekOrigin.Begin);
StreamReader sr = new StreamReader(ms, Encoding.UTF8);
LaTex = sr.ReadToEnd();
sr.Close();
reader.Close();
ms.Close();
Console.Out.WriteLine(LaTex);
File.WriteAllText("d:\\Latex.txt", LaTex);
// Create a OfficeMath instance from the
// OfficeMathML xml.
// Add the OfficeMath instance to our
// word template.
}
}
}
Hope this helps for C# developers.
来源:https://stackoverflow.com/questions/11565839/how-to-programmatically-iterate-through-subscripts-superscripts-and-equations-fo