openpyxl read tables from existing data book example?

巧了我就是萌 提交于 2020-08-24 07:03:50

问题


In the openpyxl documentation there is an example of how to place a table into a workbook but there are no examples of how to find back the tables of a workbook. I have an XLS file that has named tables in it and I want to open the file, find all of the tables and parse them. I cannot find any documentation on how to do this. Can anyone help?

In the meantime I worked it out and wrote the following class to work with openpyxl:

class NamedArray(object):

    ''' Excel Named range object

        Reproduces the named range feature of Microsoft Excel
        Assumes a definition in the form <Worksheet PinList!$A$6:$A$52 provided by openpyxl
        Written for use with, and initialised by the get_names function
        After initialisation named array can be used in the same way as for VBA in excel
        Written for openpyxl version 2.4.1, may not work with earlier versions 
    '''

    C_CAPS = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'   

    def __init__(self, wb, named_range_raw):
        ''' Initialise a NameArray object from the named_range_raw information in the given workbook

        '''
        self.sheet, cellrange_str = str(named_range_raw).split('!')
        self.sheet = self.sheet.replace("'",'') # remove the single quotes if they exist
        self.loc = wb[self.sheet]

        if ':' in cellrange_str:
            self.has_range = True
            self.has_value = False
            lo, hi = cellrange_str.split(':')
            self.ad_lo = lo.replace('$','')
            self.ad_hi = hi.replace('$','')
        else:
            self.has_range = False
            self.has_value = True
            self.ad_lo = cellrange_str.replace('$','')
            self.ad_hi = self.ad_lo

        self.row = self.get_row(self.ad_lo) 
        self.max_row = self.get_row(self.ad_hi)
        self.rows = self.max_row - self.row + 1
        self.min_col = self.col_to_n(self.ad_lo)
        self.max_col = self.col_to_n(self.ad_hi)
        self.cols    = self.max_col - self.min_col + 1


    def size_of(self):
        ''' Returns two dimensional size of named space
        '''
        return self.cols, self.rows 

    def value(self, row=1, col=1):
       ''' Returns the value at row, col
       '''
       assert row <= self.rows , 'invalid row number given'
       assert col <= self.cols , 'invalid column number given'
       return self.loc.cell(self.n_to_col(self.min_col + col-1)+str(self.row + row-1)).value    


    def __str__(self):
        ''' printed description of named space
        '''
        locs = 's ' + self.ad_lo + ':' + self.ad_hi if self.is_range else ' ' + self.ad_lo 
        return('named range'+ str(self.size_of()) + ' in sheet ' + self.sheet + ' @ location' + locs)  


    def __contains__(self, val):
        rval = False
        for row in range(1,self.rows+1):
            for col in range(1,self.cols+1):
                if self.value(row,col) == val:
                    rval = True
        return rval


    def vlookup(self, key, col):
        ''' excel style vlookup function
        '''
        assert col <= self.cols , 'invalid column number given'
        rval = None
        for row in range(1,self.rows+1):
            if self.value(row,1) == key:
                rval = self.value(row, col)
                break
        return rval


    def hlookup(self, key, row):
        ''' excel style hlookup function
        '''
        assert row <= self.rows , 'invalid row number given'
        rval = None
        for col in range(1,self.cols+1):
            if self.value(1,col) == key:
                rval = self.value(row, col)
                break
        return rval

    @classmethod
    def get_row(cls, ad):
        ''' get row number from cell string
        Cell string is assumed to be in excel format i.e "ABC123" where row is 123
        '''
        row = 0
        for l in ad:
            if l in "1234567890":
                row = row*10 + int(l)
        return row

    @classmethod
    def col_to_n(cls, ad):
        ''' find column number from xl address
            Cell string is assumed to be in excel format i.e "ABC123" where column is abc
            column number is integer represenation i.e.(A-A)*26*26 + (B-A)*26 + (C-A)
        '''
        n = 0
        for l in ad:
            if l in cls.C_CAPS:
                n = n*26 + cls.C_CAPS.find(l)+1
        return n

    @classmethod
    def n_to_col(cls, n):
        ''' make xl column address from column number
        '''
        ad = ''
        while n > 0:
            ad = cls.C_CAPS[n%26-1] + ad  
            n = n // 26
        return ad



def get_names(workbook, filt='', debug=False):
    ''' Create a structure containing all of the names in the given workbook

        filt is an optional parameter and used to create a subset of names starting with filt
        useful for IO_ring_spreadsheet as all names start with 'n_'
        if present, filt characters are stipped off the front of the name
    '''
    named_ranges = workbook.defined_names.definedName
    name_list = {}

    for named_range in named_ranges:
        name = named_range.name
        if named_range.attr_text.startswith('#REF'):
            print('WARNING: named range "', name, '" is undefined')
        elif filt == '' or name.startswith(filt):
            name_list[name[len(filt):]] = NamedArray(workbook, named_range.attr_text)

    if debug:
        with open("H:\\names.txt",'w') as log:
            for item in name_list:
                print (item, '=', name_list[item])
                log.write(item.ljust(30) + ' = ' + str(name_list[item])+'\n')

    return name_list

回答1:


I agree that the documentation does not really help, and the public API also seems to have only add_table() method. But then I found an openpyxl Issue 844 asking for a better interface, and it shows that worksheet has an _tables property.

This is enough to get a list of all tables in a file, together with some basic properties:

from openpyxl import load_workbook
wb = load_workbook(filename = 'test.xlsx')
for ws in wb.worksheets:
    print("Worksheet %s include %d tables:" % (ws.title, len(ws._tables)))
    for tbl in ws._tables:
        print(" : " + tbl.displayName)
        print("   -  name = " + tbl.name)
        print("   -  type = " + (tbl.tableType if isinstance(tbl.tableType, str) else 'n/a')
        print("   - range = " + tbl.ref)
        print("   - #cols = %d" % len(tbl.tableColumns))
        for col in tbl.tableColumns:
            print("     : " + col.name)

Note that the if/else construct is required for the tableType, since it can return NoneType (for standard tables), which is not convertible to str.




回答2:


Building on @MichalKaut's answer, I created a simple function that returns a dictionary with all tables in a given workbook. It also puts each table's data into a Pandas DataFrame.

from openpyxl import load_workbook
import pandas as pd

def get_all_tables(filename):
    """ Get all tables from a given workbook. Returns a dictionary of tables. 
        Requires a filename, which includes the file path and filename. """
    
    # Load the workbook, from the filename
    wb = load_workbook(filename=file, read_only=False, keep_vba=False, data_only=True, keep_links=False)

    # Initialize the dictionary of tables
    tables_dict = {}

    # Go through each worksheet in the workbook
    for ws_name in wb.sheetnames:
        print("")
        print(f"worksheet name: {ws_name}")
        ws = wb[ws_name]
        print(f"tables in worksheet: {len(ws._tables)}")

        # Get each table in the worksheet
        for tbl in ws.tables.values():
            print(f"table name: {tbl.name}")
            # First, add some info about the table to the dictionary
            tables_dict[tbl.name] = {
                    'table_name': tbl.name,
                    'worksheet': ws_name,
                    'num_cols': len(tbl.tableColumns),
                    'table_range': tbl.ref}

            # Grab the 'data' from the table
            data = ws[tbl.ref]

            # Now convert the table 'data' to a Pandas DataFrame
            # First get a list of all rows, including the first header row
            rows_list = []
            for row in data:
                # Get a list of all columns in each row
                cols = []
                for col in row:
                    cols.append(col.value)
                rows_list.append(cols)

            # Create a pandas dataframe from the rows_list. 
            # The first row is the column names
            df = pd.DataFrame(data=rows_list[1:], index=None, columns=rows_list[0])

            # Add the dataframe to the dictionary of tables
            tables_dict[tbl.name]['dataframe'] = df

    return tables_dict
            
# File location:
file = r"C:\Users\sean\spreadsheets\full_of_tables.xlsx"

# Run the function to return a dictionary of all tables in the Excel workbook
tables_dict = get_all_tables(filename=file)



回答3:


The answer to this has changed.

ws objects now contain the tables accessor which acts as a dictionary. Updated answer is:

tmp = [ws.tables for ws in wb.worksheets]
tbls = [{v.name:v} for t in tmp for v in t.values()]



回答4:


I'm not sure what you mean by parsing but read-support for worksheet tables has been possible since version 2.4.4. If you have questions about the details then I suggest you ask your question on the openpyxl mailing list as that is a more suitable place for this kind of discussion.




回答5:


I don't think this is possible. I seems to work similarly to images; if you read and save a file with a table it will get striped.



来源:https://stackoverflow.com/questions/43941365/openpyxl-read-tables-from-existing-data-book-example

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!