Read multiple CSV files in Pandas in chunks

问题

How to import and read multiple CSV in chunks when we have multiple csv files and total size of all csv is around 20gb?

I don't want to use Spark as i want to use a model in SkLearn so I want the solution in Pandas itself.

My code is:

allFiles = glob.glob(os.path.join(path, "*.csv"))
df = pd.concat((pd.read_csv(f,sep=",") for f in allFiles))
df.reset_index(drop=True, inplace=True)

But this is failing as the total size of all the csv in my path is 17gb.

I want to read it in chunks but I getting some error if I try like this:

  allFiles = glob.glob(os.path.join(path, "*.csv"))
  df = pd.concat((pd.read_csv(f,sep=",",chunksize=10000) for f in allFiles))
  df.reset_index(drop=True, inplace=True)

The error I get is this:

"cannot concatenate object of type ""; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid"

Can someone help?

回答1:

to read large csv file , you could use chunksize but in this case you have to use iterator like this:

for df in pd.read_csv('file.csv', sep=',', iterator=True, chunksize=10000):
    process(df)

you have to concat or append each chunk

or you could do that:

df = pd.read_csv('file.csv',, sep=',', iterator=True, chunksize=10000)
for chunk in df:
    process(chunk)

to read multiple file: for example

listfile = ['file1,'file2]
dfx = pd.DataFrame()
def process(d):
    #dfx=dfx.append(d) or dfx = pd.concat(dfx, d)
    #other coding

for f in listfile:
    for df in pd.read_csv(f, sep=',', iterator=True, chunksize=10000):
        process(df)

after you have lot of files you could use DASK or Pool from multiprocessing library to launch lot of reading process

Anyways, either you have enough memory, either you loss time

回答2:

This is an interesting question. I haven't tried this, but I think the code would look something like the script below.

import pandas as pd
import csv
import glob
import os

#os.chdir("C:\\your_path\\")
results = pd.DataFrame([])
filelist = glob.glob("C:\\your_path\\*.csv")
#dfList=[]
for filename in filelist:
    print(filename)  
    namedf = pd.read_csv(filename, skiprows=0, index_col=0)
    results = results.append(namedf)

results.to_csv('C:\\your_path\\Combinefile.csv')


chunksize = 10 ** 6
for chunk in pd.read_csv('C:\\your_path\\Combinefile.csv', chunksize=chunksize):
    process(chunk)

Maybe you could load everything into memory and process it directly, but it would probably take a lot longer to process everything.

来源：https://stackoverflow.com/questions/54987682/read-multiple-csv-files-in-pandas-in-chunks

标签

python

pandas

jupyter-notebook

sklearn-pandas