pyarrow

Using pyarrow how do you append to parquet file?

孤人 提交于 2019-12-03 05:51:16
问题 How do you append/update to a parquet file with pyarrow ? import pandas as pd import pyarrow as pa import pyarrow.parquet as pq table2 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]}) table3 = pd.DataFrame({'six': [-1, np.nan, 2.5], 'nine': ['foo', 'bar', 'baz'], 'ten': [True, False, True]}) pq.write_table(table2, './dataNew/pqTest2.parquet') #append pqTest2 here? There is nothing I found in the docs about appending parquet files. And, Can

How to read a list of parquet files from S3 as a pandas dataframe using pyarrow?

Deadly 提交于 2019-12-03 01:17:47
问题 I have a hacky way of achieving this using boto3 (1.4.4), pyarrow (0.4.1) and pandas (0.20.3). First, I can read a single parquet file locally like this: import pyarrow.parquet as pq path = 'parquet/part-r-00000-1e638be4-e31f-498a-a359-47d017a0059c.gz.parquet' table = pq.read_table(path) df = table.to_pandas() I can also read a directory of parquet files locally like this: import pyarrow.parquet as pq dataset = pq.ParquetDataset('parquet/') table = dataset.read() df = table.to_pandas() Both

How to write Parquet metadata with pyarrow?

回眸只為那壹抹淺笑 提交于 2019-12-01 16:40:28
I use pyarrow to create and analyse Parquet tables with biological information and I need to store some metadata, e.g. which sample the data comes from, how it was obtained and processed. Parquet seems to support file-wide metadata , but I cannot find how the write it via pyarrow. The closest thing I could find is how to write row-group metadata , but this seems like an overkill, since my metadata is the same for all row groups in the file. Is there any way to write file-wide Parquet metadata with pyarrow? 来源: https://stackoverflow.com/questions/52122674/how-to-write-parquet-metadata-with

AWS EMR - ModuleNotFoundError: No module named 'pyarrow'

若如初见. 提交于 2019-11-28 13:10:54
问题 I am running into this problem w/ Apache Arrow Spark Integration. Using AWS EMR w/ Spark 2.4.3 Tested this problem on both local spark single machine instance and a Cloudera cluster and everything works fine. set these in spark-env.sh export PYSPARK_PYTHON=python3 export PYSPARK_PYTHON_DRIVER=python3 confirmed this in spark shell spark.version 2.4.3 sc.pythonExec python3 SC.pythonVer python3 running basic pandas_udf with apache arrow integration results in error from pyspark.sql.functions

What are the differences between feather and parquet?

喜夏-厌秋 提交于 2019-11-28 04:29:59
Both are columnar (disk-)storage formats for use in data analysis systems. Both are integrated within Apache Arrow ( pyarrow package for python) and are designed to correspond with Arrow as a columnar in-memory analytics layer. How do both formats differ? Should you always prefer feather when working with pandas when possible? What are the use cases where feather is more suitable than parquet and the other way round? Appendix I found some hints here https://github.com/wesm/feather/issues/188 , but given the young age of this project, it's possibly a bit out of date. Not a serious speed test

How to save a huge pandas dataframe to hdfs?

前提是你 提交于 2019-11-27 22:34:52
Im working with pandas and with spark dataframes. The dataframes are always very big (> 20 GB) and the standard spark functions are not sufficient for those sizes. Currently im converting my pandas dataframe to a spark dataframe like this: dataframe = spark.createDataFrame(pandas_dataframe) I do that transformation because with spark writing dataframes to hdfs is very easy: dataframe.write.parquet(output_uri, mode="overwrite", compression="snappy") But the transformation is failing for dataframes which are bigger than 2 GB. If I transform a spark dataframe to pandas I can use pyarrow: //

What are the differences between feather and parquet?

為{幸葍}努か 提交于 2019-11-27 00:06:00
问题 Both are columnar (disk-)storage formats for use in data analysis systems. Both are integrated within Apache Arrow (pyarrow package for python) and are designed to correspond with Arrow as a columnar in-memory analytics layer. How do both formats differ? Should you always prefer feather when working with pandas when possible? What are the use cases where feather is more suitable than parquet and the other way round? Appendix I found some hints here https://github.com/wesm/feather/issues/188,

How to save a huge pandas dataframe to hdfs?

柔情痞子 提交于 2019-11-26 21:01:34
问题 Im working with pandas and with spark dataframes. The dataframes are always very big (> 20 GB) and the standard spark functions are not sufficient for those sizes. Currently im converting my pandas dataframe to a spark dataframe like this: dataframe = spark.createDataFrame(pandas_dataframe) I do that transformation because with spark writing dataframes to hdfs is very easy: dataframe.write.parquet(output_uri, mode="overwrite", compression="snappy") But the transformation is failing for