Read multiple CSV files in Pandas in chunks

删除回忆录丶 提交于 2019-12-22 10:29:28

问题


How to import and read multiple CSV in chunks when we have multiple csv files and total size of all csv is around 20gb?

I don't want to use Spark as i want to use a model in SkLearn so I want the solution in Pandas itself.

My code is:

allFiles = glob.glob(os.path.join(path, "*.csv"))
df = pd.concat((pd.read_csv(f,sep=",") for f in allFiles))
df.reset_index(drop=True, inplace=True)

But this is failing as the total size of all the csv in my path is 17gb.

I want to read it in chunks but I getting some error if I try like this:

  allFiles = glob.glob(os.path.join(path, "*.csv"))
  df = pd.concat((pd.read_csv(f,sep=",",chunksize=10000) for f in allFiles))
  df.reset_index(drop=True, inplace=True)

The error I get is this:

"cannot concatenate object of type ""; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid"

Can someone help?


回答1:


to read large csv file , you could use chunksize but in this case you have to use iterator like this:

for df in pd.read_csv('file.csv', sep=',', iterator=True, chunksize=10000):
    process(df)

you have to concat or append each chunk

or you could do that:

df = pd.read_csv('file.csv',, sep=',', iterator=True, chunksize=10000)
for chunk in df:
    process(chunk)

to read multiple file: for example

listfile = ['file1,'file2]
dfx = pd.DataFrame()
def process(d):
    #dfx=dfx.append(d) or dfx = pd.concat(dfx, d)
    #other coding

for f in listfile:
    for df in pd.read_csv(f, sep=',', iterator=True, chunksize=10000):
        process(df)

after you have lot of files you could use DASK or Pool from multiprocessing library to launch lot of reading process

Anyways, either you have enough memory, either you loss time




回答2:


This is an interesting question. I haven't tried this, but I think the code would look something like the script below.

import pandas as pd
import csv
import glob
import os

#os.chdir("C:\\your_path\\")
results = pd.DataFrame([])
filelist = glob.glob("C:\\your_path\\*.csv")
#dfList=[]
for filename in filelist:
    print(filename)  
    namedf = pd.read_csv(filename, skiprows=0, index_col=0)
    results = results.append(namedf)

results.to_csv('C:\\your_path\\Combinefile.csv')


chunksize = 10 ** 6
for chunk in pd.read_csv('C:\\your_path\\Combinefile.csv', chunksize=chunksize):
    process(chunk)

Maybe you could load everything into memory and process it directly, but it would probably take a lot longer to process everything.



来源:https://stackoverflow.com/questions/54987682/read-multiple-csv-files-in-pandas-in-chunks

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!