parquet

Force dask to_parquet to write single file

别等时光非礼了梦想. 提交于 2020-06-17 10:02:19
问题 When using dask.to_parquet(df, filename) a subfolder filename is created and several files are written to that folder, whereas pandas.to_parquet(df, filename) writes exactly one file. Can I use dask's to_parquet (without using compute() to create a pandas df) to just write a single file? 回答1: Writing to a single file is very hard within a parallelism system. Sorry, such an option is not offered by Dask (nor probably any other parallel processing library). You could in theory perform the

Force dask to_parquet to write single file

天涯浪子 提交于 2020-06-17 10:02:19
问题 When using dask.to_parquet(df, filename) a subfolder filename is created and several files are written to that folder, whereas pandas.to_parquet(df, filename) writes exactly one file. Can I use dask's to_parquet (without using compute() to create a pandas df) to just write a single file? 回答1: Writing to a single file is very hard within a parallelism system. Sorry, such an option is not offered by Dask (nor probably any other parallel processing library). You could in theory perform the

Read parquet data from ByteArrayOutputStream instead of file

隐身守侯 提交于 2020-06-12 10:08:20
问题 I would like to convert this code: import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.parquet.column.page.PageReadStore; import org.apache.parquet.example.data.simple.SimpleGroup; import org.apache.parquet.example.data.simple.convert.GroupRecordConverter; import org.apache.parquet.hadoop.ParquetFileReader; import org.apache.parquet.hadoop.util.HadoopInputFile; import org.apache.parquet.io.ColumnIOFactory; import org.apache.parquet.io

Read parquet data from ByteArrayOutputStream instead of file

浪尽此生 提交于 2020-06-12 10:06:12
问题 I would like to convert this code: import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.parquet.column.page.PageReadStore; import org.apache.parquet.example.data.simple.SimpleGroup; import org.apache.parquet.example.data.simple.convert.GroupRecordConverter; import org.apache.parquet.hadoop.ParquetFileReader; import org.apache.parquet.hadoop.util.HadoopInputFile; import org.apache.parquet.io.ColumnIOFactory; import org.apache.parquet.io

Read parquet data from ByteArrayOutputStream instead of file

雨燕双飞 提交于 2020-06-12 10:05:11
问题 I would like to convert this code: import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.parquet.column.page.PageReadStore; import org.apache.parquet.example.data.simple.SimpleGroup; import org.apache.parquet.example.data.simple.convert.GroupRecordConverter; import org.apache.parquet.hadoop.ParquetFileReader; import org.apache.parquet.hadoop.util.HadoopInputFile; import org.apache.parquet.io.ColumnIOFactory; import org.apache.parquet.io

hive table gives error Unimplemented type

▼魔方 西西 提交于 2020-05-16 22:36:55
问题 Using spark-sql-2.4.1, and writing a parquet file with schema containing |-- avg: double (nullable = true) While reading the same using val df = spark.read.format("parquet").load(); Getting error: UnsupportedOperationException: Unimplemented type: DoubleType. So what is wrong here, and how to fix this? Stack Trace: Caused by: java.lang.UnsupportedOperationException: Unimplemented type: DoubleType at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readIntBatch

Pandas to parquet NOT into file-system but get content of resulting file in variable

泄露秘密 提交于 2020-05-15 09:04:52
问题 There are several ways how a conversion from pandas to parquet is possible. e.g. pyarrow.Table.from_pandas or dataframe.to_parquet . What they have in common is that they get as a parameter a filePath where the df.parquet should be stored. I need to get the content of the written parquet file into a variable and have not seen this, yet. Mainly I want the same behavior as pandas.to_csv which returns the result as a string if no path is provided. Of course I could just write the file and read

PySpark Write Parquet Binary Column with Stats (signed-min-max.enabled)

房东的猫 提交于 2020-05-13 14:14:33
问题 I found this apache-parquet ticket https://issues.apache.org/jira/browse/PARQUET-686 which is marked as resolved for parquet-mr 1.8.2. The feature I want is the calculated min/max in the parquet metadata for a ( string or BINARY ) column. And referencing this is an email https://lists.apache.org/thread.html/%3CCANPCBc2UPm+oZFfP9oT8gPKh_v0_BF0jVEuf=Q3d-5=ugxSFbQ@mail.gmail.com%3E which uses scala instead of pyspark as an example: Configuration conf = new Configuration(); + conf.set("parquet

Transfer and write Parquet with python and pandas got timestamp error

时光怂恿深爱的人放手 提交于 2020-05-11 05:14:05
问题 I tried to concat() two parquet file with pandas in python . It can work , but when I try to write and save the Data frame to a parquet file ,it display the error : ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: I checked the doc. of pandas, it default the timestamp syntax in ms when write the parquet file. How can I white the parquet file with used schema after concat? Here is my code: import pandas as pd table1 = pd.read_parquet(path= ('path.parquet'),engine=

Flink 新场景:OLAP 引擎性能优化及应用案例

那年仲夏 提交于 2020-05-01 12:02:34
摘要:本文由阿里巴巴技术专家贺小令(晓令)分享,主要介绍 Apache Flink 新场景 OLAP 引擎,内容分为以下四部分: 背景介绍 Flink OLAP 引擎 案例介绍 未来计划 一、背景介绍 1.OLAP 及其分类 OLAP 是一种让用户可以用从不同视角方便快捷的分析数据的计算方法。主流的 OLAP 可以分为3类:多维 OLAP ( Multi-dimensional OLAP )、关系型 OLAP ( Relational OLAP ) 和混合 OLAP ( Hybrid OLAP ) 三大类。 (1)多维 OLAP ( MOLAP ) 传统的 OLAP 分析方式 数据存储在多维数据集中 (2)关系型 OLAP ( ROLAP ) 以关系数据库为核心,以关系型结构进行多维数据的表示 通过 SQL 的 where 条件以呈现传统 OLAP 的切片、切块功能 (3)混合 OLAP ( HOLAP ) 将 MOLAP 和 ROLPA 的优势结合起来,以获得更快的性能 以下将详细介绍每种分类的具体特征。 ■ 多维 OLAP ( MOLAP ) MOLAP 的典型代表是 Kylin 和 Druid。 MOLAP 处理流程 首先,对原始数据做数据预处理;然后,将预处理后的数据存至数据仓库,用户的请求通过 OLAP server 即可查询数据仓库中的数据。 MOLAP 的优点和缺点