presto

Apache Hudi重磅特性解读之全局索引

烈酒焚心 提交于 2020-07-27 15:18:01
1. 摘要 Hudi表允许多种类型操作,包括非常常用的 upsert ,当然为支持 upsert ,Hudi依赖索引机制来定位记录在哪些文件中。 当前,Hudi支持分区和非分区的数据集。分区数据集是将一组文件(数据)放在称为分区的桶中的数据集。一个Hudi数据集可能由N个分区和M个文件组成,这种组织结构也非常方便hive/presto/spark等引擎根据分区字段过滤以返回有限的数据量。而分区的值绝大多数情况下是从数据中得来,这个要求一旦一条记录映射到分区/桶,那么这个映射应该 a) 被Hudi知道;b) 在Hudi数据集生命周期里保持不变。 在一个非分区数据上Hudi需要知道recordKey -> fileId的映射以便对记录进行 upsert 操作,现有解决方案如下:a) 用户/客户端通过payload提供正确的分区值;b) 实现GlobalBloomIndex索引来扫描指定路径下的所有文件。上述两个场景下,要么需要用户提供映射信息,要么会导致扫描所有文件的性能开销。 这个方案拟实现一种新的索引类型,维护 (recordKey <-> partition, fileId) 映射或者 ((recordKey, partitionPath) → fileId) 映射,这种映射由Hudi存储和维护,可以解决上述提到的两个限制。 2. 背景 数据集类型

Effective way To use Count and JOIN in PRESTO CASE statements

你说的曾经没有我的故事 提交于 2020-07-23 06:49:43
问题 This is on back of a question raised earlier on Effective way in PRESTO to Result output with boolen values. Please refer the question for more details. On back of the question two solutions are availabe. SELECT dp.USER_NAME, dp.ID, CASE WHEN dp.sex='F' THEN 'True' ELSE 'False' END AS Rule_1, CASE WHEN dp.sex='M' THEN 'True' ELSE 'False' END AS Rule_2, CASE WHEN dp.sex not in ('M','F') THEN 'True' ELSE 'False' END AS Rule_3 FROM user_details dp where dp.Organisation='007'; SELECT dp.USER_NAME

Effective way To use Count and JOIN in PRESTO CASE statements

一个人想着一个人 提交于 2020-07-23 06:48:27
问题 This is on back of a question raised earlier on Effective way in PRESTO to Result output with boolen values. Please refer the question for more details. On back of the question two solutions are availabe. SELECT dp.USER_NAME, dp.ID, CASE WHEN dp.sex='F' THEN 'True' ELSE 'False' END AS Rule_1, CASE WHEN dp.sex='M' THEN 'True' ELSE 'False' END AS Rule_2, CASE WHEN dp.sex not in ('M','F') THEN 'True' ELSE 'False' END AS Rule_3 FROM user_details dp where dp.Organisation='007'; SELECT dp.USER_NAME

Effective way To use Count and JOIN in PRESTO CASE statements

一世执手 提交于 2020-07-23 06:47:17
问题 This is on back of a question raised earlier on Effective way in PRESTO to Result output with boolen values. Please refer the question for more details. On back of the question two solutions are availabe. SELECT dp.USER_NAME, dp.ID, CASE WHEN dp.sex='F' THEN 'True' ELSE 'False' END AS Rule_1, CASE WHEN dp.sex='M' THEN 'True' ELSE 'False' END AS Rule_2, CASE WHEN dp.sex not in ('M','F') THEN 'True' ELSE 'False' END AS Rule_3 FROM user_details dp where dp.Organisation='007'; SELECT dp.USER_NAME

Does Presto SQL support recursive query using CTE just like SQL Server? e.g. employee hierarchy level

让人想犯罪 __ 提交于 2020-07-09 05:10:53
问题 I want to write a recursive query using CTE in Presto to find Employee Hierarchy. Do Presto support recursive query? When I write simple recursion as with cte as(select 1 n union all select cte.n+1 from cte where n<50) select * from cte It gives error that Error running query: line 3:32: Table cte does not exist 回答1: Presto grammar supports WITH RECURSIVE name AS ..., but recursive WITH queries are not implemented. This is tracked as a feature request: https://github.com/prestosql/presto

Presto coordinator returning 404 error when connecting through Terradata odbc driver

半腔热情 提交于 2020-06-29 08:58:16
问题 I am attempting to connect to a presto coordinator that resides on an EMR cluster. I am using the Terradata ODBC driver. I have both tested the driver by putting the pertinent details into the DSN via ODBC connections dialog and written a simple C# application that creates a connection (see the code below). The problem is that I am getting a 404 error returned when the connection is either tested in the DSN dialog or opened in the C# code. I believe the security group settings in AWS are fine

NOT IN implementation of Presto v.s Spark SQL

我们两清 提交于 2020-06-25 10:51:33
问题 I got a very simple query which shows significant performance difference when running on Spark SQL and Presto (3 hrs v.s 3 mins) in the same hardware. SELECT field FROM test1 WHERE field NOT IN (SELECT field FROM test2) After some research of the query plan, I found out the reason is how Spark SQL deals with NOT IN predicate subquery. To correctly handle the NULL of NOT IN, Spark SQL translate the NOT IN predicate as Left AntiJoin( (test1=test2) OR isNULL(test1=test2)) . Spark SQL introduces

Getting all Buildings in range of 5 miles from specified coordinates

纵饮孤独 提交于 2020-06-22 03:54:41
问题 I have database table Building with these columns: name , lat , lng How can I get all Buildings in range of 5 miles from specified coordinates, for example these: -84.38653999999998 33.72024 My try but it does not work: SELECT ST_CONTAINS( SELECT ST_BUFFER(ST_Point(-84.38653999999998,33.72024), 5), SELECT ST_POINT(lat,lng) FROM "my_db"."Building" LIMIT 50 ); https://docs.aws.amazon.com/athena/latest/ug/geospatial-functions-list.html 回答1: Why are you storing x,y in separated columns? I would

Getting all Buildings in range of 5 miles from specified coordinates

馋奶兔 提交于 2020-06-22 03:52:46
问题 I have database table Building with these columns: name , lat , lng How can I get all Buildings in range of 5 miles from specified coordinates, for example these: -84.38653999999998 33.72024 My try but it does not work: SELECT ST_CONTAINS( SELECT ST_BUFFER(ST_Point(-84.38653999999998,33.72024), 5), SELECT ST_POINT(lat,lng) FROM "my_db"."Building" LIMIT 50 ); https://docs.aws.amazon.com/athena/latest/ug/geospatial-functions-list.html 回答1: Why are you storing x,y in separated columns? I would

Unable to convert varchar to array in Presto Athena

╄→гoц情女王★ 提交于 2020-06-12 08:59:11
问题 My data is in varchar format. I want to split both the elements of this array so that I can then extract a key value from the json. Data format: [ { "skuId": "5bc87ae20d298a283c297ca1", "unitPrice": 0, "id": "5bc87ae20d298a283c297ca1", "quantity": "1" }, { "skuId": "182784738484wefhdchs4848", "unitPrice": 50, "id": "5bc87ae20d298a283c297ca1", "quantity": "4" }, ] For e.g. I want to extract skuid from the above column. So my data after extraction should look like: 1 5bc87ae20d298a283c297ca1 2