Is it possible to use a timestamp field in the pyarrow
table to partition the s3fs
file system by "YYYY/MM/DD/HH
" while writing parquet file to s3
?
I was able to achieve with a pyarrow write_to_dataset function which allows you to specify partition columns to create subdirectories.
Example:
import os
import s3fs
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from pyarrow.filesystem import S3FSWrapper
access_key = <access_key>
secret_key = <secret_key>
bucket_name = <bucket_name>
fs = s3fs.S3FileSystem(key=access_key, secret=secret_key)
bucket_uri = 's3://{0}/{1}'.format(bucket_name, "data")
data = {'date': ['2018-03-04T14:12:15.653Z', '2018-03-03T14:12:15.653Z', '2018-03-02T14:12:15.653Z', '2018-03-05T14:12:15.653Z'],
'battles': [34, 25, 26, 57],
'citys': ['london', 'newyork', 'boston', 'boston']}
df = pd.DataFrame(data, columns=['date', 'battles', 'citys'])
df['date'] = df['date'].map(lambda t: pd.to_datetime(t, format="%Y-%m-%dT%H:%M:%S.%fZ"))
df['year'], df['month'], df['day'] = df['date'].apply(lambda x: x.year), df['date'].apply(lambda x: x.month), df['date'].apply(lambda x: x.day)
df.groupby(by=['citys'])
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, bucket_uri, filesystem=fs, partition_cols=['year', 'month', 'day'], use_dictionary=True, compression='snappy', use_deprecated_int96_timestamps=True)
As far as I can tell: No.
It can READ partitioned data but nothing related to writing.
There are several places which document the writing functions and none of them take partition options.
Can pyarrow write multiple parquet files to a folder like fastparquet's file_scheme='hive' option?
https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L941
来源:https://stackoverflow.com/questions/49085686/pyarrow-s3fs-partition-by-timestamp