问题
I am accessing an Ms Access
database through C#
. I am able to read all the fields. The problem that I am getting is, while reading .txt
and .doc
files that are stored in OLE Object
field of the table, a lot of extra junk characters are also getting read before and after the actual text like- ÿÿÿÿ‡€ ÿÿÿÿÿÿÿÿˆ ÿÿÿÿÿÿÿÿ€ ˆˆˆˆˆˆˆˆ€ ÿÿÿÿÿÿÿÿþ
.
i 8 @ñÿ 8 N o r m a l CJ _H aJ mH sH tH < A@òÿ¡ <
D e f a u l t P a r a g r a p h F o n t … ÿÿÿÿ ( f p ³ ú ÿ A Ä M • À ' n î 0 q Œ Ï
My C# code is like- `
/*Read from the query and write in a temporary file*/
var oleBytes = (Byte[])Cmd.ExecuteScalar();
MemoryStream ms = new MemoryStream();
ms.Write(oleBytes, 0, oleBytes.Length - 0);
var file = Path.GetTempFileName();
using (var fileStream = File.OpenWrite(file))
{
var buffer = ms.GetBuffer();
fileStream.Write(buffer, 0, (int)ms.Length);
}
`
Then read this temporary file like a word document- `
Microsoft.Office.Interop.Word.ApplicationClass wordObject = new ApplicationClass();
object fpath = file; //this is the path
object nullobject = System.Reflection.Missing.Value;
Microsoft.Office.Interop.Word.Document docs = wordObject.Documents.Open
(ref fpath, ref nullobject, ref nullobject, ref nullobject,
ref nullobject, ref nullobject, ref nullobject, ref nullobject,
ref nullobject, ref nullobject, ref nullobject, ref nullobject,
ref nullobject, ref nullobject, ref nullobject, ref nullobject);
docs.ActiveWindow.Selection.WholeStory();
docs.ActiveWindow.Selection.Copy();
IDataObject iData = Clipboard.GetDataObject();
if (iData != null)
data = iData.GetData(DataFormats.Text).ToString();
`
Don't know what is going wrong? Am I reading the fields metadata also from the table? If so how to avoid it? What would be the efficient way to read OLE Object
field that stores files other than images?
回答1:
I found the solution for word documents (.doc
files). OLE object storage in Ms Access contains some header information before actual data, so simply extracting the field contents as a byte array and saving it to disk does not work. Any OLE Object file has some standard signature. For word documents, OLEheaderLength is 85 bytes
. So I strip 85 bytes from both ends of the byte array like-
Con.Open();
string _query="select licenseDoc from Products where ID=56";
//Column licenseDoc contains word and text douments as OLE Objects
OleDbCommand Cmd = new OleDbCommand(_query, Con);
const int offset =85;
var oleBytes = (Byte[])Cmd.ExecuteScalar();
MemoryStream ms = new MemoryStream();
ms.Write(oleBytes, offset, oleBytes.Length - offset);
var file = Path.GetTempFileName();
using (var fileStream = File.OpenWrite(file))
{
var buffer = ms.GetBuffer();
fileStream.Write(buffer, 0, (int)ms.Length);
}
The variable file
will contain the path of the .tmp
file, which contains the data read from from the word document stored as an OLE object in Ms Access
. This file can be directly opened in as a word document
or it's extension can be changed .doc
.
The OLEheaderLength
for other formats are as follows:
1] JPEG/JPG=224
2] BMP=78
3] PDF=85
4] SNP=74
5] DOC=85/90
6] DOCX=87
I don't know the OLEheaderLength
of .txt(Simple Text) files
. Unfortunately the above solution works only for .doc
files. But when it comes to .docx
files and any other file formats, it fails.
In order to find out the length of an ole header, you could simply use the library that is explained and downloadable from here - http://jvdveen.blogspot.in/2009/02/ole-and-accessing-files-embedded-in.html
回答2:
I tried opening a DOCX(.docx
) & PDF files in Notepad++
, and found strange but standard BOF(Beginning Of File)
& EOF(End Of File)
string patterns. Then I found a solution for extracting DOCX(.docx
) files from Ms Access DB. For .docx
files, OLEheaderLength
is 87 bytes.
Con.Open();
string _query="select licenseDoc from Products where ID=56";
//Column licenseDoc contains word douments as OLE Objects
OleDbCommand Cmd = new OleDbCommand(_query, Con);
var oleBytes = (Byte[])Cmd.ExecuteScalar();
const string START_BLOCK = "PK";//DOCX files starts with "PK"
const string END_BLOCK = "PK";//DOCX files ends with "PK" followed by some fixed 20 blank chars
int startPos = -1;
int endpos = -1;
Encoding ascii = Encoding.ASCII;
string strEncoding = ascii.GetString(oleBytes);
if (strEncoding.IndexOf(START_BLOCK) != -1 && strEncoding.LastIndexOf(END_BLOCK) != -1)
{
startPos = strEncoding.IndexOf(START_BLOCK);
endpos = strEncoding.LastIndexOf(END_BLOCK) + END_BLOCK.Length + 20;
}
if (startPos == -1)
{
throw new Exception("Could not find DOCX Header");
}
byte[] retByte = new byte[endpos - startPos];
Array.Copy(oleBytes , startPos, retByte, 0, endpos - startPos);
MemoryStream ms = new MemoryStream();
ms.Write(retByte, 0, retByte.Length);
var file = Path.GetTempFileName();
using (var fileStream = File.OpenWrite(file))
{
var buffer = ms.GetBuffer();
fileStream.Write(buffer, 0, (int)ms.Length);
}
The variable file
will contain the path of the .tmp
file, which contains the data read from from the word document stored as an OLE object in Ms Access. This file can be directly opened in as a word document or it's extension can be changed to .docx
.
For PDF files, found that OLEheaderLength
is 85 OR 90.
I haven't tried this for PDF, but you can try using-
const string START_BLOCK = "%PDF";//PDF files starts with "%PDF"
const string END_BLOCK = "%EOF";//PDF files ends with "%EOF" followed by some fixed 20 blank chars
In order to find out the length of an ole header, you could simply use the library that is explained and downloadable from here - http://jvdveen.blogspot.in/2009/02/ole-and-accessing-files-embedded-in.html
来源:https://stackoverflow.com/questions/9992087/how-to-remove-junk-characters-while-reading-a-word-document-stored-in-ole-objec