Hive | 易学教程

metastore元数据库表整理

阅读更多关于 metastore元数据库表整理

笔者工作中有时候会用到HIVE META 元数据库，临时查找相关信息比较麻烦，主要常用的表如下，以备后续查阅表名说明关联键 DBS 所有hive库的基本信息 DB_ID TBLS 所有hive表的基本信息 TBL_ID,SD_ID TABLE_PARAM 表的相关属性信息，如是否外部表，表注注释及一些文件的统计信息等 TBL_ID COLUMNS Hive表字段信息(字段注释，字段名，字段类型，字段序号) SD_ID SDS 所有hive表、表分区所对应的hdfs数据目录和数据格式 SD_ID,SERDE_ID SERDE_PARAM 序列化反序列化信息，如行分隔符、列分隔符、NULL的表示字符等 SERDE_ID PARTITIONS Hive表分区信息 PART_ID,SD_ID,TBL_ID PARTITION_KEYS Hive分区表分区键 TBL_ID PARTITION_KEY_VALS Hive表分区名(键值) PART_ID TBL_PRIVS hive 权限相关信息，多数是空的 TBL_GRANT_ID 嘿嘿，这次偷下懒，改天画下图 1，有时候需要批量找出某些特征表相关信息的时候，临时组织sql比较麻烦，先记录下之前的sql 对外封装hive meta接口的时候比较有用 #比如找出表名，列名，列类型，注释，及字段序号，去掉where条件可以找出所有表

hive与数据库的比较

阅读更多关于 hive与数据库的比较

除了类似的sql语句，没什么类似之处 1.查询语言：hive的查询语言是类sql 2.数据存储的位置：hive存储在hdfs上，数据库存储在本地文件中 3.数据的更新：hive是针对数仓设计，读多写少，一般不进行数据的更新操作，而数据库要经常进行增删改查的操作 4.索引：数据库有索引，hive无索引，多余少量数据，数据库的时延较低，但是对于大量数据，hive才会体现出其优势 5.执行：hive的执行依赖于mapreduce，而数据路依赖自身的执行引擎 6.执行延迟：看数据量大小 7.可扩展性 8.数据规模来源： oschina 链接： https://my.oschina.net/u/4434424/blog/4263644

Hive笔记之Fetch Task

阅读更多关于 Hive笔记之Fetch Task

在使用Hive的时候，有时候只是想取表中某个分区的前几条的记录看下数据格式，比如一个很常用的查询： select * from foo where partition_column=bar limit 10; 这种对数据基本没什么要求，随便来点就行，既然如此为什么不直接读取本地存储的数据作为结果集呢。 Hive命令都要转换为MapReduce任务去执行，但是因为启动MapReduce需要消耗资源，然后速度还很慢（相比较于直接从本地文件中读取而言），所以Hive对于查询做了优化，对于某些查询可以不启动MapReduce任务的就尽量不去启动MapReduce任务，而是直接从本地文件读取。个人理解： fetch task = 不启动MapReduce，直接读取本地文件输出结果。在hive-site.xml中有三个fetch task相关的值： hive.fetch.task.conversion hive.fetch.task.conversion.threshold hive.fetch.task.aggr hive.fetch.task.conversion 这个属性有三个可选的值： none：关闭fetch task优化 minimal：只在select *、使用分区列过滤、带有limit的语句上进行优化 more：在minimal的基础上更加强大了，select不仅仅可以是*

Inceptor命令02-命令使用

阅读更多关于 Inceptor命令02-命令使用

beeline使用方式 1. 无认证 ./beeline -u jdbc:hive2://{inceptor_server}:10000 2. 使用Kerberos认证 kinit -kt /etc/sql2/hive.keytab hive/baogang2@TDH klist Ticket cache: FILE:/tmp/krb5cc_0 Default principal: hive/baogang2@TDH Valid starting Expires Service principal 11/21/15 15:27:03 11/22/15 01:27:03 krbtgt/TDH@TDH renew until 11/22/15 15:27:03 这时，您连接Inceptor时的身份就是hive。连接Inceptor的指令是：模板：beeline -u "jdbc:hive2://<server_ip/hostname>:10000/default;principal=<hive_principal>" beeline -u "jdbc:hive2://baogang2:10000/default;principal=hive/baogang2@TDH" 3. LDAP认证您需要通过LDAP的认证连接到Inceptor： beeline -u "jdbc:hive2:

【异常】Cannot construct instance of `com.facebook.presto.jdbc.internal.client.QueryResults`, problem...

阅读更多关于【异常】Cannot construct instance of `com.facebook.presto.jdbc.internal.client.QueryResults`, problem...

一、异常内容 Caused by: com.facebook.presto.jdbc.internal.jackson.databind.exc.InvalidDefinitionException: Cannot construct instance of `com.facebook.presto.jdbc.internal.client.QueryResults`, problem: stats is null 二、解决方式设置以下下面的session熟悉就可以了 connection.setSessionProperty("enable_hive_syntax","true"); 来源： oschina 链接： https://my.oschina.net/u/4353702/blog/4260621

To schedule a hive query on Crontab

阅读更多关于 To schedule a hive query on Crontab

问题 Can any one help me to schedule a job in Crontab which will execute a simple Hive query on specific time and provide me the output in text/log file. I have created a batch script to execute a select query , but getting error("Hive command not found") while executing it in Crontab. However same script is running fine through shell. Below is my script : ip.sh #!/bin/bash echo "Starting of Job" cd /home/hadoop/work/hive/bin hive -e 'select * from mytest.empl' echo "Script ends here" Crontab: 10

Hive Table getting created but not able to see using hive shell

阅读更多关于 Hive Table getting created but not able to see using hive shell

问题 Hi I'm Saving My dataframe as hive table using spark-sql. mydf.write().format("orc").saveAsTable("myTableName") I'm able to see that table is getting created using hadoop fs -ls /apps/hive/warehouse\dbname.db Also able to see data using spark-shell spark.sql(use dbname) spark.sql(show tables).show(false) but same tables I'm not able to see using hive shell. I have place my hive-site.xml file using. sudo cp /etc/hive/conf.dist/hive-site.xml /etc/spark/conf/ but still not able to see. can

spark记录（13）SparkSQL

阅读更多关于 spark记录（13）SparkSQL

1.Shark Shark是基于Spark计算框架之上且兼容Hive语法的SQL执行引擎，由于底层的计算采用了Spark，性能比MapReduce的Hive普遍快2倍以上，当数据全部load在内存的话，将快10倍以上，因此Shark可以作为交互式查询应用服务来使用。除了基于Spark的特性外，Shark是完全兼容Hive的语法，表结构以及UDF函数等，已有的HiveSql可以直接进行迁移至Shark上Shark底层依赖于Hive的解析器，查询优化器，但正是由于SHark的整体设计架构对Hive的依赖性太强，难以支持其长远发展，比如不能和Spark的其他组件进行很好的集成，无法满足Spark的一栈式解决大数据处理的需求。 2.SparkSQL 1.SparkSQL介绍 Hive是Shark的前身，Shark是SparkSQL的前身,SparkSQL产生的根本原因是其完全脱离了Hive的限制。 SparkSQL支持查询原生的RDD。 RDD是Spark平台的核心概念，是Spark能够高效的处理大数据的各种场景的基础。 能够在Scala中写SQL语句。支持简单的SQL语法检查，能够在Scala中写Hive语句访问Hive数据，并将结果取回作为RDD使用。 2.Spark on Hive和Hive on Spark Spark on Hive： Hive只作为储存角色

How to write custom function from percentile_approx code which gives as equal result as percentile.inc in excel?

阅读更多关于 How to write custom function from percentile_approx code which gives as equal result as percentile.inc in excel?

问题 I am using spark-sql-2.4.1v with Java 8. I need to calculate percentiles such as 25,75,90 for some given data. I tried using percentile_approx() from Spark-sql to do this. But the results of percentile_approx() are not matching the fractional percentiles of excel sheet which uses PERCENTILE.INC() . Hence, I'm wondering how to fix or adjust the percentile_approx() function. Is there anyway to overwrite or write a custom function modifying percentile_approx() which calculates fractional

【Spark】Sparkstreaming-性能调优

阅读更多关于【Spark】Sparkstreaming-性能调优

Sparkstreaming-性能调优 Spark Master at spark://node-01:7077 sparkstreaming 线程数量_百度搜索 streaming中partition里用线程池异步优化 - 曾晓森的博客 - CSDN博客第116课： Spark Streaming性能优化：如何在毫秒内处理处理大吞吐量的和数据波动比较大的程序 - CSDN博客 Spark（十二）--性能调优篇 - 蒋源德 - 博客园转：spark通过合理设置spark.default.parallelism参数提高执行效率 - Feeling - BlogJava spark通过合理设置spark.default.parallelism参数提高执行效率 - CSDN博客 Spark的性能调优 Spark运行模式（一）－－－－－Spark独立模式 - CSDN博客 spark spark.executor.cores 多个线程_百度搜索 Spark并发度理解一 - lee的个人空间 spark executor - zyc920716的博客 - CSDN博客 spark 指定相关的参数配置 num-executor executor-memory executor-cores - 新际航 - 博客园手把手教你 Spark 性能调优 - ImportNew

订阅 Hive