presto | 易学教程

CDH集成Presto

阅读更多关于 CDH集成Presto

CDH集成Presto Presto的基本环境： Linux或Mac OS X Java 8,64位（小版本151以上） Python 2.4+ 一、安装Presto 下载地址：下载Presto 1. 上传文件并解压到${CM}/cloudera/parcels tar -zxvf presto-server-0.216.tar.gz -C /opt/cloudera/parcels/ 2. 为Presto创建软连接 # 创建软连接 sudo ln -s presto-server-0.228 PRESTO # 更改权限 sudo chown cloudera-scm:cloudera-scm PRESTO presto-server-0.228 3. 为Presto指定JDK sudo vim ${PRESTO_HOME} /bin/launcher # 添加： export JAVA_HOME = /usr/java/jdk1.8 export PATH = $PATH : $JAVA_HOME /bin 4. 创建配置文件在presto根目录下创建etc文件夹，并在etc下创建配置文件 mkdir -p etc 4.1. 创建node.properties 节点属性配置 Presto集群分为两种节点： coordinator：作为主节点提供连接服务并下发、执行任务

Presto SQL window aggregate looking back x hours/minutes/seconds

阅读更多关于 Presto SQL window aggregate looking back x hours/minutes/seconds

AWS Athena (Presto) OFFSET support

阅读更多关于 AWS Athena (Presto) OFFSET support

问题 I would like to know if there is support for OFFSET in AWS Athena. For mysql the following query is running but in athena it is giving me error. Any example would be helpful. select * from employee where empSal >3000 LIMIT 300 OFFSET 20 回答1: Athena is basically managed Presto. Since Presto 311 you can use OFFSET m LIMIT n syntax or ANSI SQL equivalent: OFFSET m ROWS FETCH NEXT n ROWS ONLY . For older versions (and this includes AWS Athena as of this writing) , you can use row_number() window

How to Quickly Flatten a SQL Table

阅读更多关于 How to Quickly Flatten a SQL Table

问题 I'm using Presto. If I have a table like: ID CATEGORY VALUE 1 a ... 1 b 1 c 2 a 2 b 3 b 3 d 3 e 3 f How would you convert to the below without writing a case statement for each combination? ID A B C D E F 1 2 3 回答1: I've never used Presto and the documentation seems pretty thin, but based on this article it looks like you could do SELECT id, kv['A'] AS A, kv['B'] AS B, kv['C'] AS C, kv['D'] AS D, kv['E'] AS E, kv['F'] AS F FROM ( SELECT id, map_agg(category, value) kv FROM vtable GROUP BY id

What are the fundamental architectural, SQL compliance, and data use scenario differences between Presto and Impala?

阅读更多关于 What are the fundamental architectural, SQL compliance, and data use scenario differences between Presto and Impala?

问题 Can some experts give some succinct answers to the differences between Presto and Impala from these perspectives? Fundamental architecture design SQL compliance Real-world latency Any SPOF or fault-tolerance functionality Structured and unstructured data use scenario performance 来源： https://stackoverflow.com/questions/19841027/what-are-the-fundamental-architectural-sql-compliance-and-data-use-scenario-di

how to use presto to query hive data

阅读更多关于 how to use presto to query hive data

问题 I just installed presto and when I use the presto-cli to query hive data, I get the following error: $ ./presto --server node6:8080 --catalog hive --schema default presto:default> show tables; Query 20131113_150006_00002_u8uyp failed: Table hive.information_schema.tables does not exist The config.properties is: coordinator=true datasources=jmx,hive http-server.http.port=8080 presto-metastore.db.type=h2 presto-metastore.db.filename=/root/h2 task.max-memory=1GB discovery-server.enabled=true

What does MSCK REPAIR TABLE do behind the scenes and why it's so slow?

阅读更多关于 What does MSCK REPAIR TABLE do behind the scenes and why it's so slow?

问题 I know that MSCK REPAIR TABLE updates the metastore with the current partitions of an external table. To do that, you only need to do ls on the root folder of the table (given the table is partitioned by only one column), and get all its partitions, clearly a < 1s operation. But in practice, the operation can take a very long time to execute (or even timeout if ran on AWS Athena). So my question is, what does MSCK REPAIR TABLE actually do behind the scenes and why? How does MSCK REPAIR TABLE

What does MSCK REPAIR TABLE do behind the scenes and why it's so slow?

阅读更多关于 What does MSCK REPAIR TABLE do behind the scenes and why it's so slow?

How to group time column into 5 second intervals and count rows using Presto?

阅读更多关于 How to group time column into 5 second intervals and count rows using Presto?

问题 I am using Presto and Zeppelin. There are a lot of raw datas. I have to summarize those datas. I wanna group time every 5 seconds. serviceType logType date ------------------------------------------------------ service1 log1 2017-10-24 23:00:23.206 service1 log1 2017-10-24 23:00:23.207 service1 log1 2017-10-24 23:00:25.206 service2 log1 2017-10-24 23:00:24.206 service1 log2 2017-10-24 23:00:27.206 service1 log2 2017-10-24 23:00:29.302 then the result serviceType logType date cnt -------------

Spark incremental loading overwrite old record

阅读更多关于 Spark incremental loading overwrite old record

问题 I have a requirement to do the incremental loading to a table by using Spark (PySpark) Here's the example: Day 1 id | value ----------- 1 | abc 2 | def Day 2 id | value ----------- 2 | cde 3 | xyz Expected result id | value ----------- 1 | abc 2 | cde 3 | xyz This can be done easily in relational database, Wondering whether this can be done in Spark or other transformational tool, e.g. Presto? 回答1: Here you go! First Dataframe: >>> list1 = [(1, 'abc'),(2,'def')] >>> olddf = spark