s3fs gzip compression on pandas dataframe

问题

I'm trying to write a dataframe as a CSV file on S3 by using the s3fs library and pandas. Despite the documentation, I'm afraid the gzip compression parameter it's not working with s3fs.

def DfTos3Csv (df,file):
    with fs.open(file,'wb') as f:
       df.to_csv(f, compression='gzip', index=False)

This code saves the dataframe as a new object in S3 but in a plain CSV not in a gzip format. On the other hand, the read functionality it's working OK using this compression parameter.

def s3CsvToDf(file):
   with fs.open(file) as f:
      df = pd.read_csv(f, compression='gzip')
  return df

Suggestions/alternatives to the write issue? Thank you in advance!.

回答1:

The compression parameter of the function to_csv() does not work when writing to a stream. You have to do the zipping and uploading separately.

import gzip
import boto3
from io import BytesIO, TextIOWrapper

buffer = BytesIO()

with gzip.GzipFile(mode='w', fileobj=buffer) as zipped_file:
    df.to_csv(TextIOWrapper(zipped_file, 'utf8'), index=False)

s3_resource = boto3.resource('s3')
s3_object = s3_resource.Object('bucket_name', 'key')
s3_object.put(Body=buffer.getvalue())

来源：https://stackoverflow.com/questions/50350844/s3fs-gzip-compression-on-pandas-dataframe

标签

python

amazon-s3

dask

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!