Pyarrow s3fs partition by timestamp

泪湿孤枕 提交于 2019-12-21 05:00:45

问题


Is it possible to use a timestamp field in the pyarrow table to partition the s3fs file system by "YYYY/MM/DD/HH" while writing parquet file to s3?


回答1:


I was able to achieve with a pyarrow write_to_dataset function which allows you to specify partition columns to create subdirectories.

Example:

import os
import s3fs
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from pyarrow.filesystem import S3FSWrapper

access_key = <access_key>
secret_key = <secret_key>
bucket_name = <bucket_name>

fs = s3fs.S3FileSystem(key=access_key, secret=secret_key)

bucket_uri = 's3://{0}/{1}'.format(bucket_name, "data")

data = {'date': ['2018-03-04T14:12:15.653Z', '2018-03-03T14:12:15.653Z', '2018-03-02T14:12:15.653Z', '2018-03-05T14:12:15.653Z'],
        'battles': [34, 25, 26, 57],
        'citys': ['london', 'newyork', 'boston', 'boston']}
df = pd.DataFrame(data, columns=['date', 'battles', 'citys'])
df['date'] = df['date'].map(lambda t: pd.to_datetime(t, format="%Y-%m-%dT%H:%M:%S.%fZ"))
df['year'], df['month'], df['day'] = df['date'].apply(lambda x: x.year), df['date'].apply(lambda x: x.month), df['date'].apply(lambda x: x.day)
df.groupby(by=['citys'])
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, bucket_uri, filesystem=fs, partition_cols=['year', 'month', 'day'], use_dictionary=True,  compression='snappy', use_deprecated_int96_timestamps=True)



回答2:


As far as I can tell: No.

It can READ partitioned data but nothing related to writing.

There are several places which document the writing functions and none of them take partition options.

Can pyarrow write multiple parquet files to a folder like fastparquet's file_scheme='hive' option?

https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L941

https://issues.apache.org/jira/browse/ARROW-1858



来源:https://stackoverflow.com/questions/49085686/pyarrow-s3fs-partition-by-timestamp

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!