parquet

Apache Hudi重磅特性解读之全局索引

烈酒焚心 提交于 2020-07-27 15:18:01
1. 摘要 Hudi表允许多种类型操作,包括非常常用的 upsert ,当然为支持 upsert ,Hudi依赖索引机制来定位记录在哪些文件中。 当前,Hudi支持分区和非分区的数据集。分区数据集是将一组文件(数据)放在称为分区的桶中的数据集。一个Hudi数据集可能由N个分区和M个文件组成,这种组织结构也非常方便hive/presto/spark等引擎根据分区字段过滤以返回有限的数据量。而分区的值绝大多数情况下是从数据中得来,这个要求一旦一条记录映射到分区/桶,那么这个映射应该 a) 被Hudi知道;b) 在Hudi数据集生命周期里保持不变。 在一个非分区数据上Hudi需要知道recordKey -> fileId的映射以便对记录进行 upsert 操作,现有解决方案如下:a) 用户/客户端通过payload提供正确的分区值;b) 实现GlobalBloomIndex索引来扫描指定路径下的所有文件。上述两个场景下,要么需要用户提供映射信息,要么会导致扫描所有文件的性能开销。 这个方案拟实现一种新的索引类型,维护 (recordKey <-> partition, fileId) 映射或者 ((recordKey, partitionPath) → fileId) 映射,这种映射由Hudi存储和维护,可以解决上述提到的两个限制。 2. 背景 数据集类型

How to save dask dataframe to parquet on same machine as dask sheduler/workers?

為{幸葍}努か 提交于 2020-07-22 08:03:17
问题 I'm trying to save by Dask Dataframe to parquet on the same machine as the dask scheduler/workers are located. However, I have trouble during this. My Dask setup : My python script is executed on my local machine (laptop 16 GB RAM), but the script creates a Dask client to a Dask scheduler running on a remote machine (a server with 400 GB RAM for parallel computations). The Dask scheduler and workers are all located on the same server, thus they all share the same file system, locally

How to save dask dataframe to parquet on same machine as dask sheduler/workers?

跟風遠走 提交于 2020-07-22 08:02:09
问题 I'm trying to save by Dask Dataframe to parquet on the same machine as the dask scheduler/workers are located. However, I have trouble during this. My Dask setup : My python script is executed on my local machine (laptop 16 GB RAM), but the script creates a Dask client to a Dask scheduler running on a remote machine (a server with 400 GB RAM for parallel computations). The Dask scheduler and workers are all located on the same server, thus they all share the same file system, locally

UPSERT in parquet Pyspark

假如想象 提交于 2020-07-19 01:59:52
问题 I have parquet files in s3 with the following partitions: year / month / date / some_id Using Spark (PySpark), each day I would like to kind of UPSERT the last 14 days - I would like to replace the existing data in s3 (one parquet file for each partition), but not to delete the days that are before 14 days.. I tried two save modes: append - wasn't good because it just adds another file. overwrite - is deleting the past data and data for other partitions. Is there any way or best practice to

UPSERT in parquet Pyspark

不羁的心 提交于 2020-07-19 01:58:45
问题 I have parquet files in s3 with the following partitions: year / month / date / some_id Using Spark (PySpark), each day I would like to kind of UPSERT the last 14 days - I would like to replace the existing data in s3 (one parquet file for each partition), but not to delete the days that are before 14 days.. I tried two save modes: append - wasn't good because it just adds another file. overwrite - is deleting the past data and data for other partitions. Is there any way or best practice to

how to read and write to the same file in spark using parquet?

瘦欲@ 提交于 2020-07-06 15:02:01
问题 I am trying to read from a parquet file in spark, do a union with another rdd and then write the result into the same file I have read from (basically overwrite), this throws the following error: couldnt write parquet to file: An error occurred while calling o102.parquet. : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: TungstenExchange hashpartitioning(billID#42,200), None +- Union :- Scan ParquetRelation[units#35,price#36,priceSold#37,orderingTime#38,itemID

how to read and write to the same file in spark using parquet?

天涯浪子 提交于 2020-07-06 14:58:14
问题 I am trying to read from a parquet file in spark, do a union with another rdd and then write the result into the same file I have read from (basically overwrite), this throws the following error: couldnt write parquet to file: An error occurred while calling o102.parquet. : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: TungstenExchange hashpartitioning(billID#42,200), None +- Union :- Scan ParquetRelation[units#35,price#36,priceSold#37,orderingTime#38,itemID

Preserve dataframe partitioning when writing and re-reading to parquet file

你离开我真会死。 提交于 2020-06-25 21:23:15
问题 When I write a dataframe with a defined partitioning to disk as parquet file and then re-read the parquet file again, the partitioning is lost. Is there a way to preserve the original partitioning of the dataframe during writing and re-reading? The example code //create a dataframe with 100 partitions and print the number of partitions val originalDf = spark.sparkContext.parallelize(1 to 10000).toDF().repartition(100) println("partitions before writing to disk: " + originalDf.rdd.partitions

Preserve dataframe partitioning when writing and re-reading to parquet file

瘦欲@ 提交于 2020-06-25 21:22:47
问题 When I write a dataframe with a defined partitioning to disk as parquet file and then re-read the parquet file again, the partitioning is lost. Is there a way to preserve the original partitioning of the dataframe during writing and re-reading? The example code //create a dataframe with 100 partitions and print the number of partitions val originalDf = spark.sparkContext.parallelize(1 to 10000).toDF().repartition(100) println("partitions before writing to disk: " + originalDf.rdd.partitions

Preserve dataframe partitioning when writing and re-reading to parquet file

安稳与你 提交于 2020-06-25 21:21:10
问题 When I write a dataframe with a defined partitioning to disk as parquet file and then re-read the parquet file again, the partitioning is lost. Is there a way to preserve the original partitioning of the dataframe during writing and re-reading? The example code //create a dataframe with 100 partitions and print the number of partitions val originalDf = spark.sparkContext.parallelize(1 to 10000).toDF().repartition(100) println("partitions before writing to disk: " + originalDf.rdd.partitions