pyarrow | 易学教程

Repartitioning parquet-mr generated parquets with pyarrow/parquet-cpp increases file size by x30?

阅读更多关于 Repartitioning parquet-mr generated parquets with pyarrow/parquet-cpp increases file size by x30?

问题 Using AWS Firehose I am converting incoming records to parquet. In one example, I have 150k identical records enter firehose, and a single 30kb parquet gets written to s3. Because of how firehose partitions data, we have a secondary process (lambda triggered by s3 put event) read in the parquet and repartitions it based on the date within the event itself. After this repartitioning process, the 30kb file size jumps to 900kb. Inspecting both parquet files- The meta doesn't change The data

Repartitioning parquet-mr generated parquets with pyarrow/parquet-cpp increases file size by x30?

阅读更多关于 Repartitioning parquet-mr generated parquets with pyarrow/parquet-cpp increases file size by x30?

Secondary in-memory index representations in Python

阅读更多关于 Secondary in-memory index representations in Python

问题 I am searching for an efficient solution to build a secondary in-memory index in Python using a high-level optimised mathematical package such as numpy and arrow. I am excluding pandas for performance reasons. Definition "A secondary index contains an entry for each existing value of the attribute to be indexed. This entry can be seen as a key/value pair with the attribute value as key and as value a list of pointers to all records in the base table that have this value." - JV. D'Silva et al.

“Raise RuntimeError('Not supported on 32-bit Windows')” when installing pyarrow

阅读更多关于 “Raise RuntimeError('Not supported on 32-bit Windows')” when installing pyarrow

问题 I get this error whenever I try to install pyarrow on my PC. It is 64bit so I don't understand it: raise RuntimeError('Not supported on 32-bit Windows') RuntimeError: Not supported on 32-bit Windows ---------------------------------------- ERROR: Failed building wheel for pyarrow ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot be installed directly I have pip updated and have installed many more packages without problems. 回答1: The reason PyArrow is trying to build a 32

How to read parquet file with a condition using pyarrow in Python

阅读更多关于 How to read parquet file with a condition using pyarrow in Python

问题 I have created a parquet file with three columns (id, author, title) from database and want to read the parquet file with a condition (title='Learn Python'). Below mentioned is the python code which I am using for this POC. import pyarrow as pa import pyarrow.parquet as pq import pandas as pd import pyodbc def write_to_parquet(df, out_path, compression='SNAPPY'): arrow_table = pa.Table.from_pandas(df) if compression == 'UNCOMPRESSED': compression = None pq.write_table(arrow_table, out_path,

How to read parquet file with a condition using pyarrow in Python

阅读更多关于 How to read parquet file with a condition using pyarrow in Python

Are parquet file created with pyarrow vs pyspark compatible?

阅读更多关于 Are parquet file created with pyarrow vs pyspark compatible?

问题 I have to convert analytics data in JSON to parquet in two steps. For the large amounts of existing data I am writing a PySpark job and doing df.repartition(*partitionby).write.partitionBy(partitionby). mode("append").parquet(output,compression=codec) however for incremental data I plan to use AWS Lambda. Probably, PySpark would be an overkill for it, and hence I plan to use PyArrow for it (I am aware that it unnecessarily involves Pandas, but I couldn't find a better alternative). So,

Are parquet file created with pyarrow vs pyspark compatible?

阅读更多关于 Are parquet file created with pyarrow vs pyspark compatible?

pyarrow.lib.ArrowTypeError: an integer is required (got type str)

阅读更多关于 pyarrow.lib.ArrowTypeError: an integer is required (got type str)

问题 I want to ingest the new rows from my sql server table. The way I found to get the differential is to use the script below. For MySql tables it works perfectly. When I inserted the pymssql library to connect to this new bank and apply differential file ingestion, I run into the error below: I ask for help understanding why for tables that are on Sql Server I can't apply the script! import os import pandas as pd import numpy as np import mysql.connector as sql from datetime import datetime,

pyarrow.lib.ArrowTypeError: an integer is required (got type str)

阅读更多关于 pyarrow.lib.ArrowTypeError: an integer is required (got type str)