How to obtain sheet names from XLS files without loading the whole file?

后端 未结 6 657
独厮守ぢ
独厮守ぢ 2020-11-29 22:36

I\'m currently using pandas to read an Excel file and present its sheet names to the user, so he can select which sheet he would like to use. The problem is that the files a

6条回答
  •  日久生厌
    2020-11-29 23:02

    I have tried xlrd, pandas, openpyxl and other such libraries and all of them seem to take exponential time as the file size increase as it reads the entire file. The other solutions mentioned above where they used 'on_demand' did not work for me. The following function works for xlsx files.

    def get_sheet_details(file_path):
        sheets = []
        file_name = os.path.splitext(os.path.split(file_path)[-1])[0]
        # Make a temporary directory with the file name
        directory_to_extract_to = os.path.join(settings.MEDIA_ROOT, file_name)
        os.mkdir(directory_to_extract_to)
    
        # Extract the xlsx file as it is just a zip file
        zip_ref = zipfile.ZipFile(file_path, 'r')
        zip_ref.extractall(directory_to_extract_to)
        zip_ref.close()
    
        # Open the workbook.xml which is very light and only has meta data, get sheets from it
        path_to_workbook = os.path.join(directory_to_extract_to, 'xl', 'workbook.xml')
        with open(path_to_workbook, 'r') as f:
            xml = f.read()
            dictionary = xmltodict.parse(xml)
            for sheet in dictionary['workbook']['sheets']['sheet']:
                sheet_details = {
                    'id': sheet['sheetId'], # can be @sheetId for some versions
                    'name': sheet['name'] # can be @name
                }
                sheets.append(sheet_details)
    
        # Delete the extracted files directory
        shutil.rmtree(directory_to_extract_to)
        return sheets
    

    Since all xlsx are basically zipped files, we extract the underlying xml data and read sheet names from the workbook directly which takes a fraction of a second as compared to the library functions.

    Benchmarking: (On a 6mb xlsx file with 4 sheets)
    Pandas, xlrd: 12 seconds
    openpyxl: 24 seconds
    Proposed method: 0.4 seconds

提交回复
热议问题