Read multiple parquet files in a folder and write to single csv file using python

I am new to python and I have a scenario where there are multiple parquet files with file names in order. ex: par_file1,par_file2,par_file3 and so on upto 100 files in a folder.

I need to read these parquet files starting from file1 in order and write it to a singe csv file. After writing contents of file1, file2 contents should be appended to same csv without header. Note that all files have same column names and only data is split into multiple files.

I learnt to convert single parquet to csv file using pyarrow with the following code:

import pandas as pd    
df = pd.read_parquet('par_file.parquet')    
df.to_csv('csv_file.csv')

But I could'nt extend this to loop for multiple parquet files and append to single csv. Is there a method in pandas to do this? or any other way to do this would be of great help. Thank you.

If you are going to copy the files over to your local machine and run your code you could do something like this. The code below assumes that you are running your code in the same directory as the parquet files. It also assumes the naming of files as your provided above: "order. ex: par_file1,par_file2,par_file3 and so on upto 100 files in a folder." If you need to search for your files then you will need to get the file names using glob and explicitly provide the path where you want to save the csv: open(r'this\is\your\path\to\csv_file.csv', 'a') Hope this helps.

import pandas as pd

# Create an empty csv file and write the first parquet file with headers
with open('csv_file.csv','w') as csv_file:
    print('Reading par_file1.parquet')
    df = pd.read_parquet('par_file1.parquet')
    df.to_csv(csv_file, index=False)
    print('par_file1.parquet appended to csv_file.csv\n')
    csv_file.close()

# create your file names and append to an empty list to look for in the current directory
files = []
for i in range(2,101):
    files.append(f'par_file{i}.parquet')

# open files and append to csv_file.csv
for f in files:
    print(f'Reading {f}')
    df = pd.read_parquet(f)
    with open('csv_file.csv','a') as file:
        df.to_csv(file, header=False, index=False)
        print(f'{f} appended to csv_file.csv\n')

You can remove the print statements if you want.

Tested in python 3.6 using pandas 0.23.3

I ran into this question looking to see if pandas can natively read partitioned parquet datasets. I have to say that the current answer is unnecessarily verbose (making it difficult to parse). I also imagine that it's not particularly efficient to be constantly opening/closing file handles then scanning to the end of them depending on the size.

A better alternative would be to read all the parquet files into a single DataFrame, and write it once:

from pathlib import Path
import pandas as pd

data_dir = Path('dir/to/parquet/files')
full_df = pd.concat(
    pd.read_parquet(parquet_file)
    for parquet_file in data_dir.glob('*.parquet')
)
full_df.to_csv('csv_file.csv')

Alternatively, if you really want to just append to the file:

data_dir = Path('dir/to/parquet/files')
for i, parquet_path in enumerate(data_dir.glob('*.parquet')):
    df = pd.read_parquet(parquet_path)
    write_header = i == 0 # write header only on the 0th file
    write_mode = 'w' if i == 0 else 'a' # 'write' mode for 0th file, 'append' otherwise
    df.to_csv('csv_file.csv', mode=write_mode, header=write_header)

A final alternative for appending each file that opens the target CSV file in "a+" mode at the onset, keeping the file handle scanned to the end of the file for each write/append (I believe this works, but haven't actually tested it):

data_dir = Path('dir/to/parquet/files')
with open('csv_file.csv', "a+") as csv_handle:
    for i, parquet_path in enumerate(data_dir.glob('*.parquet')):
        df = pd.read_parquet(parquet_path)
        write_header = i == 0 # write header only on the 0th file
        df.to_csv(csv_handle, header=write_header)

来源：https://stackoverflow.com/questions/51696655/read-multiple-parquet-files-in-a-folder-and-write-to-single-csv-file-using-pytho

标签

pandas

csv

parquet