问题
Is it possible to use a timestamp field in the pyarrow
table to partition the s3fs
file system by "YYYY/MM/DD/HH
" while writing parquet file to s3
?
回答1:
I was able to achieve with a pyarrow write_to_dataset function which allows you to specify partition columns to create subdirectories.
Example:
import os
import s3fs
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from pyarrow.filesystem import S3FSWrapper
access_key = <access_key>
secret_key = <secret_key>
bucket_name = <bucket_name>
fs = s3fs.S3FileSystem(key=access_key, secret=secret_key)
bucket_uri = 's3://{0}/{1}'.format(bucket_name, "data")
data = {'date': ['2018-03-04T14:12:15.653Z', '2018-03-03T14:12:15.653Z', '2018-03-02T14:12:15.653Z', '2018-03-05T14:12:15.653Z'],
'battles': [34, 25, 26, 57],
'citys': ['london', 'newyork', 'boston', 'boston']}
df = pd.DataFrame(data, columns=['date', 'battles', 'citys'])
df['date'] = df['date'].map(lambda t: pd.to_datetime(t, format="%Y-%m-%dT%H:%M:%S.%fZ"))
df['year'], df['month'], df['day'] = df['date'].apply(lambda x: x.year), df['date'].apply(lambda x: x.month), df['date'].apply(lambda x: x.day)
df.groupby(by=['citys'])
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, bucket_uri, filesystem=fs, partition_cols=['year', 'month', 'day'], use_dictionary=True, compression='snappy', use_deprecated_int96_timestamps=True)
回答2:
As far as I can tell: No.
It can READ partitioned data but nothing related to writing.
There are several places which document the writing functions and none of them take partition options.
Can pyarrow write multiple parquet files to a folder like fastparquet's file_scheme='hive' option?
https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L941
https://issues.apache.org/jira/browse/ARROW-1858
来源:https://stackoverflow.com/questions/49085686/pyarrow-s3fs-partition-by-timestamp