How to remove junk characters while reading a word document stored in 'OLE Object' field in an access database through C#?

问题

I am accessing an Ms Access database through C#. I am able to read all the fields. The problem that I am getting is, while reading .txt and .doc files that are stored in OLE Object field of the table, a lot of extra junk characters are also getting read before and after the actual text like- ÿÿÿÿ‡€ ÿÿÿÿÿÿÿÿˆ ÿÿÿÿÿÿÿÿ€ ˆˆˆˆˆˆˆˆ€ ÿÿÿÿÿÿÿÿþ i 8 @ñÿ 8 N o r m a l CJ _H aJ mH sH tH < A@òÿ¡ < D e f a u l t P a r a g r a p h F o n t … ÿÿÿÿ ( f p ³ ú ÿ A Ä M • À ' n î 0 q Œ Ï.

My C# code is like- `

/*Read from the query and write in a temporary file*/
var oleBytes = (Byte[])Cmd.ExecuteScalar();
MemoryStream ms = new MemoryStream();
ms.Write(oleBytes, 0, oleBytes.Length - 0);
var file = Path.GetTempFileName();
using (var fileStream = File.OpenWrite(file))
 {
    var buffer = ms.GetBuffer();
    fileStream.Write(buffer, 0, (int)ms.Length);
 }

Then read this temporary file like a word document- `

Microsoft.Office.Interop.Word.ApplicationClass wordObject = new ApplicationClass();
object fpath = file; //this is the path
object nullobject = System.Reflection.Missing.Value;
Microsoft.Office.Interop.Word.Document docs = wordObject.Documents.Open
(ref fpath, ref nullobject, ref nullobject, ref nullobject,
ref nullobject, ref nullobject, ref nullobject, ref nullobject,
ref nullobject, ref nullobject, ref nullobject, ref nullobject,
ref nullobject, ref nullobject, ref nullobject, ref nullobject);

docs.ActiveWindow.Selection.WholeStory();

docs.ActiveWindow.Selection.Copy();

IDataObject iData = Clipboard.GetDataObject();

if (iData != null)
  data = iData.GetData(DataFormats.Text).ToString();

Don't know what is going wrong? Am I reading the fields metadata also from the table? If so how to avoid it? What would be the efficient way to read OLE Object field that stores files other than images?

回答1:

I found the solution for word documents (.doc files). OLE object storage in Ms Access contains some header information before actual data, so simply extracting the field contents as a byte array and saving it to disk does not work. Any OLE Object file has some standard signature. For word documents, OLEheaderLength is 85 bytes. So I strip 85 bytes from both ends of the byte array like-

Con.Open();
string _query="select licenseDoc from Products where ID=56";
//Column licenseDoc contains word and text douments as OLE Objects
OleDbCommand Cmd = new OleDbCommand(_query, Con);

const int offset =85;
var oleBytes = (Byte[])Cmd.ExecuteScalar();
MemoryStream ms = new MemoryStream();
ms.Write(oleBytes, offset, oleBytes.Length - offset);

var file = Path.GetTempFileName();
using (var fileStream = File.OpenWrite(file))
{
  var buffer = ms.GetBuffer();
  fileStream.Write(buffer, 0, (int)ms.Length);
}

The variable file will contain the path of the .tmp file, which contains the data read from from the word document stored as an OLE object in Ms Access. This file can be directly opened in as a word document or it's extension can be changed .doc.

The OLEheaderLength for other formats are as follows:

1] JPEG/JPG=224
2] BMP=78
3] PDF=85
4] SNP=74
5] DOC=85/90
6] DOCX=87

I don't know the OLEheaderLength of .txt(Simple Text) files. Unfortunately the above solution works only for .doc files. But when it comes to .docx files and any other file formats, it fails.

In order to find out the length of an ole header, you could simply use the library that is explained and downloadable from here - http://jvdveen.blogspot.in/2009/02/ole-and-accessing-files-embedded-in.html

回答2:

I tried opening a DOCX(.docx) & PDF files in Notepad++, and found strange but standard BOF(Beginning Of File) & EOF(End Of File) string patterns. Then I found a solution for extracting DOCX(.docx) files from Ms Access DB. For .docx files, OLEheaderLength is 87 bytes.

Con.Open();
string _query="select licenseDoc from Products where ID=56";
//Column licenseDoc contains word douments as OLE Objects
OleDbCommand Cmd = new OleDbCommand(_query, Con);

var oleBytes = (Byte[])Cmd.ExecuteScalar();

const string START_BLOCK = "PK";//DOCX files starts with "PK"
const string END_BLOCK = "PK";//DOCX files ends with "PK" followed by some fixed 20 blank chars
int startPos = -1;
int endpos = -1;

Encoding ascii = Encoding.ASCII;
string strEncoding = ascii.GetString(oleBytes);
if (strEncoding.IndexOf(START_BLOCK) != -1 && strEncoding.LastIndexOf(END_BLOCK) != -1)
{
     startPos = strEncoding.IndexOf(START_BLOCK);
     endpos = strEncoding.LastIndexOf(END_BLOCK) + END_BLOCK.Length + 20;
}
if (startPos == -1)
{
     throw new Exception("Could not find DOCX Header");
}

byte[] retByte = new byte[endpos - startPos];

Array.Copy(oleBytes , startPos, retByte, 0, endpos - startPos);

MemoryStream ms = new MemoryStream();
ms.Write(retByte, 0, retByte.Length);

var file = Path.GetTempFileName();
using (var fileStream = File.OpenWrite(file))
{
  var buffer = ms.GetBuffer();
  fileStream.Write(buffer, 0, (int)ms.Length);
}

The variable file will contain the path of the .tmp file, which contains the data read from from the word document stored as an OLE object in Ms Access. This file can be directly opened in as a word document or it's extension can be changed to .docx.

For PDF files, found that OLEheaderLength is 85 OR 90. I haven't tried this for PDF, but you can try using-

const string START_BLOCK = "%PDF";//PDF files starts with "%PDF"
const string END_BLOCK = "%EOF";//PDF files ends with "%EOF" followed by some fixed 20 blank chars

来源：https://stackoverflow.com/questions/9992087/how-to-remove-junk-characters-while-reading-a-word-document-stored-in-ole-objec

标签

ms-access

ms-access-2007