pyarrow

Repartitioning parquet-mr generated parquets with pyarrow/parquet-cpp increases file size by x30?

旧时模样 提交于 2020-04-30 16:37:15
问题 Using AWS Firehose I am converting incoming records to parquet. In one example, I have 150k identical records enter firehose, and a single 30kb parquet gets written to s3. Because of how firehose partitions data, we have a secondary process (lambda triggered by s3 put event) read in the parquet and repartitions it based on the date within the event itself. After this repartitioning process, the 30kb file size jumps to 900kb. Inspecting both parquet files- The meta doesn't change The data

Repartitioning parquet-mr generated parquets with pyarrow/parquet-cpp increases file size by x30?

不打扰是莪最后的温柔 提交于 2020-04-30 16:36:37
问题 Using AWS Firehose I am converting incoming records to parquet. In one example, I have 150k identical records enter firehose, and a single 30kb parquet gets written to s3. Because of how firehose partitions data, we have a secondary process (lambda triggered by s3 put event) read in the parquet and repartitions it based on the date within the event itself. After this repartitioning process, the 30kb file size jumps to 900kb. Inspecting both parquet files- The meta doesn't change The data

Secondary in-memory index representations in Python

老子叫甜甜 提交于 2020-03-18 09:46:57
问题 I am searching for an efficient solution to build a secondary in-memory index in Python using a high-level optimised mathematical package such as numpy and arrow. I am excluding pandas for performance reasons. Definition "A secondary index contains an entry for each existing value of the attribute to be indexed. This entry can be seen as a key/value pair with the attribute value as key and as value a list of pointers to all records in the base table that have this value." - JV. D'Silva et al.

“Raise RuntimeError('Not supported on 32-bit Windows')” when installing pyarrow

|▌冷眼眸甩不掉的悲伤 提交于 2020-03-02 06:54:07
问题 I get this error whenever I try to install pyarrow on my PC. It is 64bit so I don't understand it: raise RuntimeError('Not supported on 32-bit Windows') RuntimeError: Not supported on 32-bit Windows ---------------------------------------- ERROR: Failed building wheel for pyarrow ERROR: Could not build wheels for pyarrow which use PEP 517 and cannot be installed directly I have pip updated and have installed many more packages without problems. 回答1: The reason PyArrow is trying to build a 32

How to read parquet file with a condition using pyarrow in Python

时间秒杀一切 提交于 2020-02-26 10:04:46
问题 I have created a parquet file with three columns (id, author, title) from database and want to read the parquet file with a condition (title='Learn Python'). Below mentioned is the python code which I am using for this POC. import pyarrow as pa import pyarrow.parquet as pq import pandas as pd import pyodbc def write_to_parquet(df, out_path, compression='SNAPPY'): arrow_table = pa.Table.from_pandas(df) if compression == 'UNCOMPRESSED': compression = None pq.write_table(arrow_table, out_path,

How to read parquet file with a condition using pyarrow in Python

时光毁灭记忆、已成空白 提交于 2020-02-26 10:04:35
问题 I have created a parquet file with three columns (id, author, title) from database and want to read the parquet file with a condition (title='Learn Python'). Below mentioned is the python code which I am using for this POC. import pyarrow as pa import pyarrow.parquet as pq import pandas as pd import pyodbc def write_to_parquet(df, out_path, compression='SNAPPY'): arrow_table = pa.Table.from_pandas(df) if compression == 'UNCOMPRESSED': compression = None pq.write_table(arrow_table, out_path,

Are parquet file created with pyarrow vs pyspark compatible?

…衆ロ難τιáo~ 提交于 2020-02-25 06:03:40
问题 I have to convert analytics data in JSON to parquet in two steps. For the large amounts of existing data I am writing a PySpark job and doing df.repartition(*partitionby).write.partitionBy(partitionby). mode("append").parquet(output,compression=codec) however for incremental data I plan to use AWS Lambda. Probably, PySpark would be an overkill for it, and hence I plan to use PyArrow for it (I am aware that it unnecessarily involves Pandas, but I couldn't find a better alternative). So,

Are parquet file created with pyarrow vs pyspark compatible?

点点圈 提交于 2020-02-25 06:03:39
问题 I have to convert analytics data in JSON to parquet in two steps. For the large amounts of existing data I am writing a PySpark job and doing df.repartition(*partitionby).write.partitionBy(partitionby). mode("append").parquet(output,compression=codec) however for incremental data I plan to use AWS Lambda. Probably, PySpark would be an overkill for it, and hence I plan to use PyArrow for it (I am aware that it unnecessarily involves Pandas, but I couldn't find a better alternative). So,

pyarrow.lib.ArrowTypeError: an integer is required (got type str)

依然范特西╮ 提交于 2020-01-24 16:50:35
问题 I want to ingest the new rows from my sql server table. The way I found to get the differential is to use the script below. For MySql tables it works perfectly. When I inserted the pymssql library to connect to this new bank and apply differential file ingestion, I run into the error below: I ask for help understanding why for tables that are on Sql Server I can't apply the script! import os import pandas as pd import numpy as np import mysql.connector as sql from datetime import datetime,

pyarrow.lib.ArrowTypeError: an integer is required (got type str)

≡放荡痞女 提交于 2020-01-24 16:48:05
问题 I want to ingest the new rows from my sql server table. The way I found to get the differential is to use the script below. For MySql tables it works perfectly. When I inserted the pymssql library to connect to this new bank and apply differential file ingestion, I run into the error below: I ask for help understanding why for tables that are on Sql Server I can't apply the script! import os import pandas as pd import numpy as np import mysql.connector as sql from datetime import datetime,