pyarrow | 易学教程

How to set/get Pandas dataframes into Redis using pyarrow

阅读更多关于 How to set/get Pandas dataframes into Redis using pyarrow

问题 Using dd = {'ID': ['H576','H577','H578','H600', 'H700'], 'CD': ['AAAAAAA', 'BBBBB', 'CCCCCC','DDDDDD', 'EEEEEEE']} df = pd.DataFrame(dd) Pre Pandas 0.25, this below worked. set: redisConn.set("key", df.to_msgpack(compress='zlib')) get: pd.read_msgpack(redisConn.get("key")) Now, there are deprecated warnings.. FutureWarning: to_msgpack is deprecated and will be removed in a future version. It is recommended to use pyarrow for on-the-wire transmission of pandas objects. The read_msgpack is

pandasUDF and pyarrow 0.15.0

阅读更多关于 pandasUDF and pyarrow 0.15.0

问题 I have recently started getting a bunch of errors on a number of pyspark jobs running on EMR clusters. The erros are java.lang.IllegalArgumentException at java.nio.ByteBuffer.allocate(ByteBuffer.java:334) at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543) at org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58) at org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132) at org

How to specify logical types when writing Parquet files from PyArrow?

阅读更多关于 How to specify logical types when writing Parquet files from PyArrow?

问题 I'm using PyArrow to write Parquet files from some Pandas dataframes in Python. Is there a way that I can specify the logical types that are written to the parquet file? For for example, writing an np.uint32 column in PyArrow results in an INT64 column in the parquet file, whereas writing the same using the fastparquet module results in an INT32 column with a logical type of UINT_32 (this is the behaviour I'm after from PyArrow). E.g.: import pandas as pd import pyarrow as pa import pyarrow

UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings

阅读更多关于 UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings

问题 I am running spark 2.4.2 locally through pyspark for an ML project in NLP. Part of the pre-processing steps in the Pipeline involve the use of pandas_udf functions optimized through pyarrow . Each time I operate with the pre-processed spark dataframe the following warning appears: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings.warn("pyarrow.open_stream is deprecated, please use " I tried updating pyarrow but didn't manage to avoid the warning. My

Read Parquet file stored in S3 with AWS Lambda (Python 3)

阅读更多关于 Read Parquet file stored in S3 with AWS Lambda (Python 3)

问题 I am trying to load, process and write Parquet files in S3 with AWS Lambda. My testing / deployment process is: https://github.com/lambci/docker-lambda as a container to mock the Amazon environment, because of the native libraries that need to be installed (numpy amongst others). This procedure to generate a zip file: http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example-deployment-pkg.html#with-s3-example-deployment-pkg-python Add a test python function to the zip, send it to S3,

Read Parquet file stored in S3 with AWS Lambda (Python 3)

阅读更多关于 Read Parquet file stored in S3 with AWS Lambda (Python 3)

Pyarrow s3fs partition by timestamp

阅读更多关于 Pyarrow s3fs partition by timestamp

问题 Is it possible to use a timestamp field in the pyarrow table to partition the s3fs file system by " YYYY/MM/DD/HH " while writing parquet file to s3 ? 回答1: I was able to achieve with a pyarrow write_to_dataset function which allows you to specify partition columns to create subdirectories. Example: import os import s3fs import pandas as pd import pyarrow as pa import pyarrow.parquet as pq from pyarrow.filesystem import S3FSWrapper access_key = <access_key> secret_key = <secret_key> bucket

ModuleNotFoundError: No module named 'pyarrow' with satisfied requirements

阅读更多关于 ModuleNotFoundError: No module named 'pyarrow' with satisfied requirements

问题 I am trying to run this command in Jupyter Notebook: import pyarrow, get the same error: "ModuleNotFoundError: No module named 'pyarrow' I have installed it already with pip3 and brew also. So when I ran pip3install pyarrow it says requirements are already satisfied. All other libraries I have installed runs with no issues from the same directory. Thank you. 回答1: This is an odd one, for sure. I am not familiar enough with pyarrow to know why the following worked. From the docs, If I do pip3

How to speed up creation of rolling sum (LTM) in pandas with large dataset?

阅读更多关于 How to speed up creation of rolling sum (LTM) in pandas with large dataset?

问题 I want to calculate the moving sum (rolling twelve months) of daily sales for a dataset with 400k rows and 7 columns. My current approach appears to work but is pretty slow (between 1-2 minutes). Columns include: date (daily entries), country, item name (product), customer city, customer number (ID) and customer name As other datasets I work with are much larger (2+ million rows and more) it would be great if you have suggestions on how to speed up the current code: import pandas as pd import

Error opening a parquet file on Amazon S3 using pyarrow

阅读更多关于 Error opening a parquet file on Amazon S3 using pyarrow

问题 I have this code, which is supposed to read a single column data from a parquet file stored on S3: fs = s3fs.S3FileSystem() data_set = pq.ParquetDataset(f"s3://{bucket}/{key}", filesystem=fs) column_data = data_set.read(columns=[col_name]) and I get this excption: validate_schemas self.schema = self.pieces[0].get_metadata(open_file).schema IndexError: list index out of range I upgraded to the latest version of pyarrow but it did not help 来源： https://stackoverflow.com/questions/52057964/error