Looping through .xlsx files using pandas, only does first file

我的未来我决定 提交于 2020-01-06 03:39:48

问题


My ultimate goal is to merge the contents of a folder full of .xlsx files into one big file.

I thought the below code would suffice, but it only does the first file, and I can't figure out why it stops there. The files are small (~6 KB), so it shouldn't be a matter of waiting. If I print f_list, it shows the complete list of files. So, where am I going wrong? To be clear, there is no error returned, it just does not do the entire for loop. I feel like there should be a simple fix, but being new to Python and coding, I'm having trouble seeing it.

I'm doing this with Anaconda on Windows 8.

import pandas as pd
import glob
f_list = glob.glob("C:\\Users\\me\\dt\\xx\\*.xlsx")  # creates my file list
all_data = pd.DataFrame()             # creates my DataFrame

for f in f_list:                      # basic for loop to go through file list but doesn't
    df = pd.read_excel(f)             # reads .xlsx file
    all_data = all_data.append(df)    # appends file contents to DataFrame
all_data.to_excel("output.xlsx")      # creates new .xlsx

Edit with new information:

After trying some of the suggested changes, I noticed the output claiming the files are empty, except for 1 of them which is slightly larger than the others. If I put them into the DataFrame, it claims the DataFrame is empty. If I put it into the dict, it claims there are no values associated. Could this have something to do with the file size? Many, if not most, of these files have 3-5 rows with 5 columns. The one it does see has 12 rows.


回答1:


I strongly recommend reading the DataFrames into a dict:

sheets = {f: pd.read_excel(f) for f in f_list}

For one thing this is very easy to debug: just inspect the dict in the REPL.

Another is that you can then concat these into one DataFrame efficiently in one pass:

pd.concat(sheets.values())

Note: This is significantly faster than append, which has to allocate a temporary DataFrame at each append-call.


An alternative issue is that your glob may not be picking up all the files, you should check that it is by printing f_list.



来源:https://stackoverflow.com/questions/32831446/looping-through-xlsx-files-using-pandas-only-does-first-file

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!