pyarrow

Can pyarrow write multiple parquet files to a folder like fastparquet's file_scheme='hive' option?

天大地大妈咪最大 提交于 2019-12-10 17:53:16
问题 I have a multi-million record SQL table that I'm planning to write out to many parquet files in a folder, using the pyarrow library. The data content seems too large to store in a single parquet file. However, I can't seem to find an API or parameter with the pyarrow library that allows me to specify something like: file_scheme="hive" As is supported by the fastparquet python library. Here's my sample code: #!/usr/bin/python import pyodbc import pandas as pd import pyarrow as pa import

Using predicates to filter rows from pyarrow.parquet.ParquetDataset

让人想犯罪 __ 提交于 2019-12-10 11:54:44
问题 I have a parquet dataset stored on s3, and I would like to query specific rows from the dataset. I was able to do that using petastorm but now I want to do that using only pyarrow . Here's my attempt: import pyarrow.parquet as pq import s3fs fs = s3fs.S3FileSystem() dataset = pq.ParquetDataset( 'analytics.xxx', filesystem=fs, validate_schema=False, filters=[('event_name', '=', 'SomeEvent')] ) df = dataset.read_pandas().to_pandas() But that returns a pandas DataFrame as if the filter didn't

read a parquet files from HDFS using PyArrow

情到浓时终转凉″ 提交于 2019-12-09 20:42:51
问题 I know I can connect to an HDFS cluster via pyarrow using pyarrow.hdfs.connect() I also know I can read a parquet file using pyarrow.parquet 's read_table() However, read_table() accepts a filepath, whereas hdfs.connect() gives me a HadoopFileSystem instance. Is it somehow possible to use just pyarrow (with libhdfs3 installed) to get a hold of a parquet file/folder residing in an HDFS cluster? What I wish to get to is the to_pydict() function, then I can pass the data along. 回答1: Try fs = pa

Streaming parquet file python and only downsampling

回眸只為那壹抹淺笑 提交于 2019-12-07 16:07:25
问题 I have data in parquet format which is too big to fit into memory (6 GB). I am looking for a way to read and process the file using Python 3.6. Is there a way to stream the file, down-sample, and save to a dataframe ? Ultimately, I would like to have the data in dataframe format to work with. Am I wrong to attempt to do this without using a spark framework? I have tried using pyarrow and fastparquet but I get memory errors on trying to read the entire file in. Any tips or suggestions would be

Streaming parquet file python and only downsampling

↘锁芯ラ 提交于 2019-12-05 21:53:00
I have data in parquet format which is too big to fit into memory (6 GB). I am looking for a way to read and process the file using Python 3.6. Is there a way to stream the file, down-sample, and save to a dataframe ? Ultimately, I would like to have the data in dataframe format to work with. Am I wrong to attempt to do this without using a spark framework? I have tried using pyarrow and fastparquet but I get memory errors on trying to read the entire file in. Any tips or suggestions would be greatly appreciated! Spark is certainly a viable choice for this task. We're planning to add streaming

read a parquet files from HDFS using PyArrow

我的梦境 提交于 2019-12-04 16:04:26
I know I can connect to an HDFS cluster via pyarrow using pyarrow.hdfs.connect() I also know I can read a parquet file using pyarrow.parquet 's read_table() However, read_table() accepts a filepath, whereas hdfs.connect() gives me a HadoopFileSystem instance. Is it somehow possible to use just pyarrow (with libhdfs3 installed) to get a hold of a parquet file/folder residing in an HDFS cluster? What I wish to get to is the to_pydict() function, then I can pass the data along. Try fs = pa.hdfs.connect(...) fs.read_parquet('/path/to/hdfs-file', **other_options) or import pyarrow.parquet as pq

How to set/get Pandas dataframes into Redis using pyarrow

心已入冬 提交于 2019-12-04 09:55:21
Using dd = {'ID': ['H576','H577','H578','H600', 'H700'], 'CD': ['AAAAAAA', 'BBBBB', 'CCCCCC','DDDDDD', 'EEEEEEE']} df = pd.DataFrame(dd) Pre Pandas 0.25, this below worked. set: redisConn.set("key", df.to_msgpack(compress='zlib')) get: pd.read_msgpack(redisConn.get("key")) Now, there are deprecated warnings.. FutureWarning: to_msgpack is deprecated and will be removed in a future version. It is recommended to use pyarrow for on-the-wire transmission of pandas objects. The read_msgpack is deprecated and will be removed in a future version. It is recommended to use pyarrow for on-the-wire

How to write Parquet metadata with pyarrow?

丶灬走出姿态 提交于 2019-12-04 03:07:28
问题 I use pyarrow to create and analyse Parquet tables with biological information and I need to store some metadata, e.g. which sample the data comes from, how it was obtained and processed. Parquet seems to support file-wide metadata, but I cannot find how the write it via pyarrow. The closest thing I could find is how to write row-group metadata, but this seems like an overkill, since my metadata is the same for all row groups in the file. Is there any way to write file-wide Parquet metadata

Pyarrow s3fs partition by timestamp

人盡茶涼 提交于 2019-12-03 15:39:29
Is it possible to use a timestamp field in the pyarrow table to partition the s3fs file system by " YYYY/MM/DD/HH " while writing parquet file to s3 ? I was able to achieve with a pyarrow write_to_dataset function which allows you to specify partition columns to create subdirectories. Example: import os import s3fs import pandas as pd import pyarrow as pa import pyarrow.parquet as pq from pyarrow.filesystem import S3FSWrapper access_key = <access_key> secret_key = <secret_key> bucket_name = <bucket_name> fs = s3fs.S3FileSystem(key=access_key, secret=secret_key) bucket_uri = 's3://{0}/{1}'

Read Parquet file stored in S3 with AWS Lambda (Python 3)

99封情书 提交于 2019-12-03 15:02:45
I am trying to load, process and write Parquet files in S3 with AWS Lambda. My testing / deployment process is: https://github.com/lambci/docker-lambda as a container to mock the Amazon environment, because of the native libraries that need to be installed (numpy amongst others). This procedure to generate a zip file: http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example-deployment-pkg.html#with-s3-example-deployment-pkg-python Add a test python function to the zip, send it to S3, update the lambda and test it It seems that there are two possible approaches, which both work locally to