Extracting the text inside a docx file

自作多情 提交于 2019-12-11 15:35:45

问题


I am using the below code to read .docx file and it is successfully extracting the text from the file. But the problem here is, it is just extracting the text. For example if my document data is like below

I am line 1

I am line 2      I am some other text

Then it is returning me like

I am line 1I am line 2I am some other text.

I just want as it is. How can I do that. Below is the code I am using now.

 open System
 open System.IO
 open System.IO.Packaging
 open System.Xml

 let getDocxContent (path: string) =
 use package = Package.Open(path, FileMode.Open)
 let stream = package.GetPart(new Uri("/word/document.xml",UriKind.Relative)).GetStream()
 stream.Seek(0L, SeekOrigin.Begin) |> ignore
 let xmlDoc = new XmlDocument()
 xmlDoc.Load(stream)
 xmlDoc.DocumentElement.InnerText
 let docData = getDocxContent @"C:\a1.docx"
 printfn "%s" docData

回答1:


You need to set the PreserveWhitespace property on your XmlDocument before loading it.

So change the code from:

let xmlDoc = new XmlDocument()
xmlDoc.Load(stream)

To:

let xmlDoc = new XmlDocument()
xmlDoc.PreserveWhitespace <- true
xmlDoc.Load(stream)


来源:https://stackoverflow.com/questions/9485990/extracting-the-text-inside-a-docx-file

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!