Using predicates to filter rows from pyarrow.parquet.ParquetDataset

问题

I have a parquet dataset stored on s3, and I would like to query specific rows from the dataset. I was able to do that using petastorm but now I want to do that using only pyarrow.

Here's my attempt:

import pyarrow.parquet as pq
import s3fs

fs = s3fs.S3FileSystem()

dataset = pq.ParquetDataset(
    'analytics.xxx', 
    filesystem=fs, 
    validate_schema=False, 
    filters=[('event_name', '=', 'SomeEvent')]
)

df = dataset.read_pandas().to_pandas()

But that returns a pandas DataFrame as if the filter didn't work, i.e I have rows with various values of event_name. Is there something I am missing or something I misunderstood? I could filter after getting the pandas DataFrame but I would use much more memory space than needed.

回答1:

Currently, the filters functionality is only implemented at the file level, not yet at the row level.

So if you have a dataset as a collection of multiple, partitioned parquet files in a nested hierarchy (the type of partitioned datasets described here: https://arrow.apache.org/docs/python/parquet.html#partitioned-datasets-multiple-files), you can use the filters argument to only read a subset of the files.
But, you can't yet use it for reading only a subset of the row groups of a single file (see https://issues.apache.org/jira/browse/ARROW-1796).

But, it would be nice that you get an error message of specifying such an invalid filter. I opened an issue for that: https://issues.apache.org/jira/browse/ARROW-5572

来源：https://stackoverflow.com/questions/56522977/using-predicates-to-filter-rows-from-pyarrow-parquet-parquetdataset

标签

python

pandas

amazon-s3

parquet

pyarrow