pyarrow

How to set/get Pandas dataframes into Redis using pyarrow

余生颓废 提交于 2020-01-01 09:55:36
问题 Using dd = {'ID': ['H576','H577','H578','H600', 'H700'], 'CD': ['AAAAAAA', 'BBBBB', 'CCCCCC','DDDDDD', 'EEEEEEE']} df = pd.DataFrame(dd) Pre Pandas 0.25, this below worked. set: redisConn.set("key", df.to_msgpack(compress='zlib')) get: pd.read_msgpack(redisConn.get("key")) Now, there are deprecated warnings.. FutureWarning: to_msgpack is deprecated and will be removed in a future version. It is recommended to use pyarrow for on-the-wire transmission of pandas objects. The read_msgpack is

pandasUDF and pyarrow 0.15.0

只谈情不闲聊 提交于 2019-12-23 08:02:46
问题 I have recently started getting a bunch of errors on a number of pyspark jobs running on EMR clusters. The erros are java.lang.IllegalArgumentException at java.nio.ByteBuffer.allocate(ByteBuffer.java:334) at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543) at org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58) at org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132) at org

How to specify logical types when writing Parquet files from PyArrow?

那年仲夏 提交于 2019-12-22 10:53:06
问题 I'm using PyArrow to write Parquet files from some Pandas dataframes in Python. Is there a way that I can specify the logical types that are written to the parquet file? For for example, writing an np.uint32 column in PyArrow results in an INT64 column in the parquet file, whereas writing the same using the fastparquet module results in an INT32 column with a logical type of UINT_32 (this is the behaviour I'm after from PyArrow). E.g.: import pandas as pd import pyarrow as pa import pyarrow

UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings

走远了吗. 提交于 2019-12-22 09:55:06
问题 I am running spark 2.4.2 locally through pyspark for an ML project in NLP. Part of the pre-processing steps in the Pipeline involve the use of pandas_udf functions optimized through pyarrow . Each time I operate with the pre-processed spark dataframe the following warning appears: UserWarning: pyarrow.open_stream is deprecated, please use pyarrow.ipc.open_stream warnings.warn("pyarrow.open_stream is deprecated, please use " I tried updating pyarrow but didn't manage to avoid the warning. My

Read Parquet file stored in S3 with AWS Lambda (Python 3)

人盡茶涼 提交于 2019-12-21 05:07:13
问题 I am trying to load, process and write Parquet files in S3 with AWS Lambda. My testing / deployment process is: https://github.com/lambci/docker-lambda as a container to mock the Amazon environment, because of the native libraries that need to be installed (numpy amongst others). This procedure to generate a zip file: http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example-deployment-pkg.html#with-s3-example-deployment-pkg-python Add a test python function to the zip, send it to S3,

Read Parquet file stored in S3 with AWS Lambda (Python 3)

China☆狼群 提交于 2019-12-21 05:07:13
问题 I am trying to load, process and write Parquet files in S3 with AWS Lambda. My testing / deployment process is: https://github.com/lambci/docker-lambda as a container to mock the Amazon environment, because of the native libraries that need to be installed (numpy amongst others). This procedure to generate a zip file: http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example-deployment-pkg.html#with-s3-example-deployment-pkg-python Add a test python function to the zip, send it to S3,

Pyarrow s3fs partition by timestamp

泪湿孤枕 提交于 2019-12-21 05:00:45
问题 Is it possible to use a timestamp field in the pyarrow table to partition the s3fs file system by " YYYY/MM/DD/HH " while writing parquet file to s3 ? 回答1: I was able to achieve with a pyarrow write_to_dataset function which allows you to specify partition columns to create subdirectories. Example: import os import s3fs import pandas as pd import pyarrow as pa import pyarrow.parquet as pq from pyarrow.filesystem import S3FSWrapper access_key = <access_key> secret_key = <secret_key> bucket

ModuleNotFoundError: No module named 'pyarrow' with satisfied requirements

旧巷老猫 提交于 2019-12-13 03:42:30
问题 I am trying to run this command in Jupyter Notebook: import pyarrow, get the same error: "ModuleNotFoundError: No module named 'pyarrow' I have installed it already with pip3 and brew also. So when I ran pip3install pyarrow it says requirements are already satisfied. All other libraries I have installed runs with no issues from the same directory. Thank you. 回答1: This is an odd one, for sure. I am not familiar enough with pyarrow to know why the following worked. From the docs, If I do pip3

How to speed up creation of rolling sum (LTM) in pandas with large dataset?

爷,独闯天下 提交于 2019-12-13 03:24:17
问题 I want to calculate the moving sum (rolling twelve months) of daily sales for a dataset with 400k rows and 7 columns. My current approach appears to work but is pretty slow (between 1-2 minutes). Columns include: date (daily entries), country, item name (product), customer city, customer number (ID) and customer name As other datasets I work with are much larger (2+ million rows and more) it would be great if you have suggestions on how to speed up the current code: import pandas as pd import

Error opening a parquet file on Amazon S3 using pyarrow

房东的猫 提交于 2019-12-11 01:23:56
问题 I have this code, which is supposed to read a single column data from a parquet file stored on S3: fs = s3fs.S3FileSystem() data_set = pq.ParquetDataset(f"s3://{bucket}/{key}", filesystem=fs) column_data = data_set.read(columns=[col_name]) and I get this excption: validate_schemas self.schema = self.pieces[0].get_metadata(open_file).schema IndexError: list index out of range I upgraded to the latest version of pyarrow but it did not help 来源: https://stackoverflow.com/questions/52057964/error