Python: Access embedded OLE from Office/Excel document without clipboard

前端 未结 3 955
不思量自难忘°
不思量自难忘° 2020-12-18 23:11

I want to add and extract files from an Office/Excel document using Python. So far adding things is easy but for extracting I haven\'t found a clean solution.

To mak

3条回答
  •  天涯浪人
    2020-12-19 00:11

    Well, I find Parfait's solution a bit hackish (in the bad sense) because

    • it assumes that Excel will save the embedding as a temporary file,
    • it assumes that the path of this temporary file is always the user's default temp path,
    • it assumes that you will have privileges to open files there,
    • it assumes that you use a naming convention to identify your objects (e.g. 'test_txt' is always found in the name, you can't insert an object 'account_data'),
    • it assumes that this convention is not disturbed by the operating system (e.g. it will not change it to '~test_tx(1)' to save character length),
    • it assumes that this convention is known and accepted by all other programs on the computer (no one else will uses names that contain 'test_txt').

    So, I wrote an alternative solution. The essence of this is thef following:

    1. unzip the .xlsx file (or any other Office file in the new XML-based format, which is not password protected) to a temporary path.

    2. iterate through all .bin files inside the '/xxx/embeddings' ('xxx' = 'xl' or 'word' or 'ppt'), and create a dictionary that contains the .bin files' temporary paths as keys and the dictionaries returned from step 3 as values.

    3. extract information from the .bin file according to the (not very well documented) Ole Packager format, and return the information as a dictionary. (Retrieves the raw binary data as 'contents', not only from .txt but any file type, e.g. .png)

    I'm still learning Python, so this is not perfect (no error checking, no performance optimization) but you can get the idea from it. I tested it on a few examples. Here is my code:

    import tempfile
    import os
    import shutil
    import zipfile
    import glob
    import pythoncom
    import win32com.storagecon
    
    
    def read_zipped_xml_bin_embeddings( path_zipped_xml ):
        temp_dir = tempfile.mkdtemp()
    
        zip_file = zipfile.ZipFile( path_zipped_xml )
        zip_file.extractall( temp_dir )
        zip_file.close()
    
        subdir = {
                '.xlsx': 'xl',
                '.xlsm': 'xl',
                '.xltx': 'xl',
                '.xltm': 'xl',
                '.docx': 'word',
                '.dotx': 'word',
                '.docm': 'word',
                '.dotm': 'word',
                '.pptx': 'ppt',
                '.pptm': 'ppt',
                '.potx': 'ppt',
                '.potm': 'ppt',
            }[ os.path.splitext( path_zipped_xml )[ 1 ] ]
        embeddings_dir = temp_dir + '\\' + subdir + '\\embeddings\\*.bin'
    
        result = {}
        for bin_file in list( glob.glob( embeddings_dir ) ):
            result[ bin_file ] = bin_embedding_to_dictionary( bin_file )
    
        shutil.rmtree( temp_dir )
    
        return result
    
    
    def bin_embedding_to_dictionary( bin_file ):
        storage = pythoncom.StgOpenStorage( bin_file, None, win32com.storagecon.STGM_READ | win32com.storagecon.STGM_SHARE_EXCLUSIVE )
        for stastg in storage.EnumElements():
            if stastg[ 0 ] == '\1Ole10Native':
                stream = storage.OpenStream( stastg[ 0 ], None, win32com.storagecon.STGM_READ | win32com.storagecon.STGM_SHARE_EXCLUSIVE )
    
                result = {}
                result[ 'original_filename' ] = '' # original filename in ANSI starts at byte 7 and is null terminated
                stream.Seek( 6, 0 )
                while True:
                    ch = stream.Read( 1 )
                    if ch == '\0':
                        break
                    result[ 'original_filename' ] += ch
    
                result[ 'original_filepath' ] = '' # original filepath in ANSI is next and is null terminated
                while True:
                    ch = stream.Read( 1 )
                    if ch == '\0':
                        break
                    result[ 'original_filepath' ] += ch
    
                stream.Seek( 4, 1 ) # next 4 bytes is unused
    
                temporary_filepath_size = 0 # size of the temporary file path in ANSI in little endian
                temporary_filepath_size |= ord( stream.Read( 1 ) ) << 0
                temporary_filepath_size |= ord( stream.Read( 1 ) ) << 8
                temporary_filepath_size |= ord( stream.Read( 1 ) ) << 16
                temporary_filepath_size |= ord( stream.Read( 1 ) ) << 24
    
                result[ 'temporary_filepath' ] = stream.Read( temporary_filepath_size ) # temporary file path in ANSI
    
                result[ 'size' ] = 0 # size of the contents in little endian
                result[ 'size' ] |= ord( stream.Read( 1 ) ) << 0
                result[ 'size' ] |= ord( stream.Read( 1 ) ) << 8
                result[ 'size' ] |= ord( stream.Read( 1 ) ) << 16
                result[ 'size' ] |= ord( stream.Read( 1 ) ) << 24
    
                result[ 'contents' ] = stream.Read( result[ 'size' ] ) # contents
    
                return result
    

    You can use it like this:

    objects = read_zipped_xml_bin_embeddings( dir_path + '\\test_excel.xlsx' )
    obj = objects.values()[ 0 ] # Get first element, or iterate somehow, the keys are the temporary paths
    print( 'Original filename: ' + obj[ 'original_filename' ] )
    print( 'Original filepath: ' + obj[ 'original_filepath' ] )
    print( 'Original filepath: ' + obj[ 'temporary_filepath' ] )
    print( 'Contents: ' + obj[ 'contents' ] )
    

提交回复
热议问题