pyarrow | 易学教程

Can pyarrow write multiple parquet files to a folder like fastparquet's file_scheme='hive' option?

阅读更多关于 Can pyarrow write multiple parquet files to a folder like fastparquet's file_scheme='hive' option?

问题 I have a multi-million record SQL table that I'm planning to write out to many parquet files in a folder, using the pyarrow library. The data content seems too large to store in a single parquet file. However, I can't seem to find an API or parameter with the pyarrow library that allows me to specify something like: file_scheme="hive" As is supported by the fastparquet python library. Here's my sample code: #!/usr/bin/python import pyodbc import pandas as pd import pyarrow as pa import

Using predicates to filter rows from pyarrow.parquet.ParquetDataset

阅读更多关于 Using predicates to filter rows from pyarrow.parquet.ParquetDataset

问题 I have a parquet dataset stored on s3, and I would like to query specific rows from the dataset. I was able to do that using petastorm but now I want to do that using only pyarrow . Here's my attempt: import pyarrow.parquet as pq import s3fs fs = s3fs.S3FileSystem() dataset = pq.ParquetDataset( 'analytics.xxx', filesystem=fs, validate_schema=False, filters=[('event_name', '=', 'SomeEvent')] ) df = dataset.read_pandas().to_pandas() But that returns a pandas DataFrame as if the filter didn't

read a parquet files from HDFS using PyArrow

阅读更多关于 read a parquet files from HDFS using PyArrow

问题 I know I can connect to an HDFS cluster via pyarrow using pyarrow.hdfs.connect() I also know I can read a parquet file using pyarrow.parquet 's read_table() However, read_table() accepts a filepath, whereas hdfs.connect() gives me a HadoopFileSystem instance. Is it somehow possible to use just pyarrow (with libhdfs3 installed) to get a hold of a parquet file/folder residing in an HDFS cluster? What I wish to get to is the to_pydict() function, then I can pass the data along. 回答1: Try fs = pa

Streaming parquet file python and only downsampling

阅读更多关于 Streaming parquet file python and only downsampling

问题 I have data in parquet format which is too big to fit into memory (6 GB). I am looking for a way to read and process the file using Python 3.6. Is there a way to stream the file, down-sample, and save to a dataframe ? Ultimately, I would like to have the data in dataframe format to work with. Am I wrong to attempt to do this without using a spark framework? I have tried using pyarrow and fastparquet but I get memory errors on trying to read the entire file in. Any tips or suggestions would be

Streaming parquet file python and only downsampling

阅读更多关于 Streaming parquet file python and only downsampling

I have data in parquet format which is too big to fit into memory (6 GB). I am looking for a way to read and process the file using Python 3.6. Is there a way to stream the file, down-sample, and save to a dataframe ? Ultimately, I would like to have the data in dataframe format to work with. Am I wrong to attempt to do this without using a spark framework? I have tried using pyarrow and fastparquet but I get memory errors on trying to read the entire file in. Any tips or suggestions would be greatly appreciated! Spark is certainly a viable choice for this task. We're planning to add streaming

read a parquet files from HDFS using PyArrow

阅读更多关于 read a parquet files from HDFS using PyArrow

I know I can connect to an HDFS cluster via pyarrow using pyarrow.hdfs.connect() I also know I can read a parquet file using pyarrow.parquet 's read_table() However, read_table() accepts a filepath, whereas hdfs.connect() gives me a HadoopFileSystem instance. Is it somehow possible to use just pyarrow (with libhdfs3 installed) to get a hold of a parquet file/folder residing in an HDFS cluster? What I wish to get to is the to_pydict() function, then I can pass the data along. Try fs = pa.hdfs.connect(...) fs.read_parquet('/path/to/hdfs-file', **other_options) or import pyarrow.parquet as pq

How to set/get Pandas dataframes into Redis using pyarrow

阅读更多关于 How to set/get Pandas dataframes into Redis using pyarrow

Using dd = {'ID': ['H576','H577','H578','H600', 'H700'], 'CD': ['AAAAAAA', 'BBBBB', 'CCCCCC','DDDDDD', 'EEEEEEE']} df = pd.DataFrame(dd) Pre Pandas 0.25, this below worked. set: redisConn.set("key", df.to_msgpack(compress='zlib')) get: pd.read_msgpack(redisConn.get("key")) Now, there are deprecated warnings.. FutureWarning: to_msgpack is deprecated and will be removed in a future version. It is recommended to use pyarrow for on-the-wire transmission of pandas objects. The read_msgpack is deprecated and will be removed in a future version. It is recommended to use pyarrow for on-the-wire

How to write Parquet metadata with pyarrow?

阅读更多关于 How to write Parquet metadata with pyarrow?

问题 I use pyarrow to create and analyse Parquet tables with biological information and I need to store some metadata, e.g. which sample the data comes from, how it was obtained and processed. Parquet seems to support file-wide metadata, but I cannot find how the write it via pyarrow. The closest thing I could find is how to write row-group metadata, but this seems like an overkill, since my metadata is the same for all row groups in the file. Is there any way to write file-wide Parquet metadata

Pyarrow s3fs partition by timestamp

阅读更多关于 Pyarrow s3fs partition by timestamp

Is it possible to use a timestamp field in the pyarrow table to partition the s3fs file system by " YYYY/MM/DD/HH " while writing parquet file to s3 ? I was able to achieve with a pyarrow write_to_dataset function which allows you to specify partition columns to create subdirectories. Example: import os import s3fs import pandas as pd import pyarrow as pa import pyarrow.parquet as pq from pyarrow.filesystem import S3FSWrapper access_key = <access_key> secret_key = <secret_key> bucket_name = <bucket_name> fs = s3fs.S3FileSystem(key=access_key, secret=secret_key) bucket_uri = 's3://{0}/{1}'

Read Parquet file stored in S3 with AWS Lambda (Python 3)

阅读更多关于 Read Parquet file stored in S3 with AWS Lambda (Python 3)

I am trying to load, process and write Parquet files in S3 with AWS Lambda. My testing / deployment process is: https://github.com/lambci/docker-lambda as a container to mock the Amazon environment, because of the native libraries that need to be installed (numpy amongst others). This procedure to generate a zip file: http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example-deployment-pkg.html#with-s3-example-deployment-pkg-python Add a test python function to the zip, send it to S3, update the lambda and test it It seems that there are two possible approaches, which both work locally to