问题
I am using the below code to read .docx file and it is successfully extracting the text from the file. But the problem here is, it is just extracting the text. For example if my document data is like below
I am line 1
I am line 2 I am some other text
Then it is returning me like
I am line 1I am line 2I am some other text.
I just want as it is. How can I do that. Below is the code I am using now.
open System
open System.IO
open System.IO.Packaging
open System.Xml
let getDocxContent (path: string) =
use package = Package.Open(path, FileMode.Open)
let stream = package.GetPart(new Uri("/word/document.xml",UriKind.Relative)).GetStream()
stream.Seek(0L, SeekOrigin.Begin) |> ignore
let xmlDoc = new XmlDocument()
xmlDoc.Load(stream)
xmlDoc.DocumentElement.InnerText
let docData = getDocxContent @"C:\a1.docx"
printfn "%s" docData
回答1:
You need to set the PreserveWhitespace property on your XmlDocument before loading it.
So change the code from:
let xmlDoc = new XmlDocument()
xmlDoc.Load(stream)
To:
let xmlDoc = new XmlDocument()
xmlDoc.PreserveWhitespace <- true
xmlDoc.Load(stream)
来源:https://stackoverflow.com/questions/9485990/extracting-the-text-inside-a-docx-file