Hive

Hive RegexSerDe Multiline Log matching

陌路散爱 提交于 2021-02-07 05:52:33
问题 I am looking for a regex that can be fed to a "create external table" statement of Hive QL in the form of "input.regex"="the regex goes here" The condition is that the logs in the files that the RegexSerDe must be reading are of the following form: 2013-02-12 12:03:22,323 [DEBUG] 2636hd3e-432g-dfg3-dwq3-y4dsfq3ew91b Some message that can contain any special character, including linebreaks. This one does not have a linebreak. It just has spaces on the same line. 2013-02-12 12:03:24,527 [DEBUG]

Hive RegexSerDe Multiline Log matching

末鹿安然 提交于 2021-02-07 05:52:12
问题 I am looking for a regex that can be fed to a "create external table" statement of Hive QL in the form of "input.regex"="the regex goes here" The condition is that the logs in the files that the RegexSerDe must be reading are of the following form: 2013-02-12 12:03:22,323 [DEBUG] 2636hd3e-432g-dfg3-dwq3-y4dsfq3ew91b Some message that can contain any special character, including linebreaks. This one does not have a linebreak. It just has spaces on the same line. 2013-02-12 12:03:24,527 [DEBUG]

Why we need to move external table to managed hive table?

假如想象 提交于 2021-02-07 03:43:30
问题 I am new to Hadoop and learning Hive. In Hadoop definative guide 3rd edition page no. 428 last paragraph I don't understand below paragraph regarding external table in HIVE. "A common pattern is to use an external table to access an initial dataset stored in HDFS (created by another process), then use a Hive transform to move the data into a managed Hive table." Can anybody explain briefly what above phrase says? 回答1: Usually the data in the initial dataset is not constructed in the optimal

Why we need to move external table to managed hive table?

别说谁变了你拦得住时间么 提交于 2021-02-07 03:42:10
问题 I am new to Hadoop and learning Hive. In Hadoop definative guide 3rd edition page no. 428 last paragraph I don't understand below paragraph regarding external table in HIVE. "A common pattern is to use an external table to access an initial dataset stored in HDFS (created by another process), then use a Hive transform to move the data into a managed Hive table." Can anybody explain briefly what above phrase says? 回答1: Usually the data in the initial dataset is not constructed in the optimal

Why we need to move external table to managed hive table?

℡╲_俬逩灬. 提交于 2021-02-07 03:42:00
问题 I am new to Hadoop and learning Hive. In Hadoop definative guide 3rd edition page no. 428 last paragraph I don't understand below paragraph regarding external table in HIVE. "A common pattern is to use an external table to access an initial dataset stored in HDFS (created by another process), then use a Hive transform to move the data into a managed Hive table." Can anybody explain briefly what above phrase says? 回答1: Usually the data in the initial dataset is not constructed in the optimal

auxService:mapreduce_shuffle does not exist on hive

百般思念 提交于 2021-02-07 03:06:27
问题 I am using hive 1.2.0 and hadoop 2.6.0. whenever I am running hive on my machine... select query works fine but in case of count(*) it shows following error: Diagnostic Messages for this Task: Container launch failed for container_1434646588807_0001_01_000005 : org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The auxService:mapreduce_shuffle does not exist at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl

基于Docker的大数据开发环境

折月煮酒 提交于 2021-02-06 20:37:25
大数据开发对运行环境和数据的依赖比较重,比如开发Spark应用,往往会依赖Hive,但本地开发环境是没有Hive的,所以需要在本地和服务器之间拷贝代码,效率不高,我认为用Docker在本地搭建一套单机的大数据集群,然后将代码拷贝到容器里进行测试,可以改善这种情况。我自己对这个思路做过探索:https://github.com/iamabug/BigDataParty,这个镜像安装了Hadoop、Hive、Spark等组件,基本可以满足需求,但是也有一些问题存在,比如有时需要对配置进行调整来保持和生产环境的一致性,虽然可以做,但工作量也不少。 其实,CDH和HDP都提供了类似的单机镜像,其中HDP中组件的版本比较新,并且和公司的技术栈一致,因此来探索一下,如果体验更好的话,以后就用它来进行相关的开发了。 CDH镜像相关:https://hub.docker.com/r/cloudera/quickstart/ HDP镜像相关:https://www.cloudera.com/tutorials/sandbox-deployment-and-install-guide/3.html Sandbox获取 系统要求 安装Docker 17.09 或更新的版本 对于Windows和Mac,Docker需要配置10G以上的内存 脚本下载与执行 可以在浏览器里访问https://www

Hive之GROUP BY详解

人盡茶涼 提交于 2021-02-06 12:38:59
一,GROUP BY 执行理解 先来看下表1,表名为test: 表1   执行如下SQL语句: SELECT name from test GROUP BY name ; 你应该很容易知道运行的结果,没错,就是下表2: 表2   可是为了能够更好的理解“group by”多个列“和”聚合函数“的应用,我建议在思考的过程中,由表1到表2的过程中,增加一个虚构的中间表:虚拟表3。下面说说如何来思考上面SQL语句执行情况: 1.FROM test:该句执行后,应该结果和表1一样,就是原来的表。 2.FROM test Group BY name:该句执行后,我们想象生成了虚拟表3,如下所图所示,生成过程是这样的:group by name,那么找name那一列,具有相同name值的行,合并成一行,如对于name值为aa的,那么<1 aa 2>与<2 aa 3>两行合并成1行,所有的id值和number值写到一个单元格里面。 3.接下来就要针对虚拟表3执行Select语句了: (1)如果执行select *的话,那么返回的结果应该是虚拟表3,可是id和number中有的单元格里面的内容是多个值的,而关系数据库就是基于关系的,单元格中是不允许有多个值的,所以你看,执行select * 语句就报错了。 (2)我们再看name列,每个单元格只有一个数据,所以我们select name的话

hive日志文件分析sql语句

隐身守侯 提交于 2021-02-06 12:38:46
Ip 访问频率 Hive 中创建日志表 hive> create table KINDWEB_access_log_5_20(ip string, s1 string, s2 string, date string, way string, url string, w1 string, w2 string, w3 string) row format delimited fields terminated by ' '; 导入日志数据到 hive hive> load data inpath '/testdata/accessdata/KINDWEB/201505/KINDWEB_access_log.2015-05-20.txt' overwrite into table KINDWEB_access_log_5_20; 查看表结构 hive> desc KINDWEB_access_log_5_20; 查看表数据 hive> select * from KINDWEB_access_log_5_20; 查看表前 10 条数据 hive> select * from KINDWEB_access_log_5_20 limit 10; select 查询 ip 访问频率 hive> select ip,count(*) from kindweb_access_log_5_20

hive与mysql对比之max、group by、日志分析

落爺英雄遲暮 提交于 2021-02-06 10:52:06
前期准备 mysql模型:test_max_date(id int,name varchar(255),num int,date date) hive模型: create table test_date_max(id int,name string,rq Date); insert into table test_date_max values (1,"1","2020-12-25"), (2,"1","2020-12-28"), (3,"2","2020-12-25"), (4,"2","2020-12-20") ; 需求 查询每个人最新状态 计算逻辑 每个人有多条数据,日期越大,状态越新 计算过程 mysql: SELECT id,name,date,max(date) from test_max_date group by name ORDER BY id hive: select name,max(rq) from test_date_max group by name; 错误信息说明:在之前的帖子中说过hive groupby的问题。 这里hive中有id,name,日期。id是主键不重复,name是可以重复的,按照name分组,对rq使用max函数,其实是对name去重,返回name每个重复值组中的最大日期 就好比一个公司分了几个部门,部门是确定的