I am a beginner with Python. I have multiple CSV files (more than 10), and all of them have same number of columns. I would like to merge all of them into a single CSV file,
While I think that the best answer is the one from @valentin, you can do this without using csv
module at all:
import glob
interesting_files = glob.glob("*.csv")
header_saved = False
with open('output.csv','wb') as fout:
for filename in interesting_files:
with open(filename) as fin:
header = next(fin)
if not header_saved:
fout.write(header)
header_saved = True
for line in fin:
fout.write(line)
Your attempt is almost working, but the issues are:
Here's the corrected code, passing the csv object direcly to csv.writerows
method for shorter & faster code. Also writing the title from the first file to the output file.
import glob
import csv
output_file = 'output.csv'
header_written = False
with open(output_file,'w',newline="") as fout: # just "wb" in python 2
wout = csv.writer(fout,delimiter=',')
# filter out output
interesting_files = [x for x in glob.glob("*.csv") if x != output_file]
for filename in interesting_files:
print('Processing {}'.format(filename))
with open(filename) as fin:
cr = csv.reader(fin,delmiter=",")
header = cr.next() #skip header
if not header_written:
wout.writerow(header)
header_written = True
wout.writerows(cr)
Note that solutions using raw line-by-line processing miss an important point: if the header is multi-line, they miserably fail, botching the title line/repeating part of it several time, efficiently corrupting the file.
csv module (or pandas, too) handle those cases gracefully.
If you dont mind the overhead, you could use pandas which is shipped with common python distributions. If you plan do more with speadsheet tables, I recommend using pandas rather than trying to write your own libraries.
import pandas as pd
import glob
interesting_files = glob.glob("*.csv")
df_list = []
for filename in sorted(interesting_files):
df_list.append(pd.read_csv(filename))
full_df = pd.concat(df_list)
full_df.to_csv('output.csv')
Just a little more on pandas. Because it is made to deal with spreadsheet like data, it knows the first line is a header. When reading a CSV it separates the data table from the header which is kept as metadata of the dataframe
, the standard datatype in pandas. If you concat several of these dataframes
it concatenates only the dataparts if their headers are the same. If the headers are not the same it fails and gives you an error. Probably a good thing in case your directory is polluted with CSV files from another source.
Another thing: I just added sorted()
around the interesting_files
. I assume your files are named in order and this order should be kept. I am not sure about glob, but the os
functions are not necessarily returning files sorted by their name.
If you are on a linux system:
head -1 director/one_file.csv > output csv ## writing the header to the final file
tail -n +2 director/*.csv >> output.csv ## writing the content of all csv starting with second line into final file
Your indentation is wrong, you need to put the loop inside the with block. You can also pass the file object to writer.writerows.
import csv
with open('output.csv','wb') as fout:
wout = csv.writer(fout)
interesting_files = glob.glob("*.csv")
for filename in interesting_files:
print 'Processing',filename
with open(filename,'rb') as fin:
next(fin) # skip header
wout.writerows(fin)