Pyarrow s3fs partition by timestamp

问题

Is it possible to use a timestamp field in the pyarrow table to partition the s3fs file system by "YYYY/MM/DD/HH" while writing parquet file to s3?

回答1:

I was able to achieve with a pyarrow write_to_dataset function which allows you to specify partition columns to create subdirectories.

Example:

import os
import s3fs
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from pyarrow.filesystem import S3FSWrapper

access_key = <access_key>
secret_key = <secret_key>
bucket_name = <bucket_name>

fs = s3fs.S3FileSystem(key=access_key, secret=secret_key)

bucket_uri = 's3://{0}/{1}'.format(bucket_name, "data")

data = {'date': ['2018-03-04T14:12:15.653Z', '2018-03-03T14:12:15.653Z', '2018-03-02T14:12:15.653Z', '2018-03-05T14:12:15.653Z'],
        'battles': [34, 25, 26, 57],
        'citys': ['london', 'newyork', 'boston', 'boston']}
df = pd.DataFrame(data, columns=['date', 'battles', 'citys'])
df['date'] = df['date'].map(lambda t: pd.to_datetime(t, format="%Y-%m-%dT%H:%M:%S.%fZ"))
df['year'], df['month'], df['day'] = df['date'].apply(lambda x: x.year), df['date'].apply(lambda x: x.month), df['date'].apply(lambda x: x.day)
df.groupby(by=['citys'])
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, bucket_uri, filesystem=fs, partition_cols=['year', 'month', 'day'], use_dictionary=True,  compression='snappy', use_deprecated_int96_timestamps=True)

回答2:

As far as I can tell: No.

It can READ partitioned data but nothing related to writing.

There are several places which document the writing functions and none of them take partition options.

Can pyarrow write multiple parquet files to a folder like fastparquet's file_scheme='hive' option?

https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L941

https://issues.apache.org/jira/browse/ARROW-1858

来源：https://stackoverflow.com/questions/49085686/pyarrow-s3fs-partition-by-timestamp

标签

python

pyarrow