Hive

regexp_extract in hive giving error

流过昼夜 提交于 2021-02-05 06:39:25
问题 I have some data in table e.g.: id,params 123,utm_content=doit|utm_source=direct| 234,utm_content=polo|utm_source=AndroidNew| desired data using regexp_extract: id,channel,content 123,direct,doit 234,AndroidNew,polo Query used: Select id, REGEXP_extract(lower(params),'(.*utm_source=)([^\|]*)(\|*)',2) as channel, REGEXP_extract(lower(params),'(.*utm_content=)([^\|]*)(\|*)',2) as content from table; It is showing error '* dangling meta character' and returning error code 2 Can someone help here

使用 Iceberg on Kubernetes 打造新一代云原生数据湖

坚强是说给别人听的谎言 提交于 2021-02-05 03:01:07
作者徐蓓,腾讯云容器专家工程师,10年研发经验,7年云计算领域经验。负责腾讯云 TKE 大数据云原生、离在线混部、Serverless 架构与研发。 背景 大数据发展至今,按照 Google 2003年发布的《The Google File System》第一篇论文算起,已走过17个年头。可惜的是 Google 当时并没有开源其技术,“仅仅”是发表了三篇技术论文。所以回头看,只能算是揭开了大数据时代的帷幕。随着 Hadoop 的诞生,大数据进入了高速发展的时代,大数据的红利及商业价值也不断被释放。现今大数据存储和处理需求越来越多样化,在后 Hadoop 时代,如何构建一个统一的数据湖存储,并在其上进行多种形式的数据分析,成了企业构建大数据生态的一个重要方向。怎样快速、一致、原子性地在数据湖存储上构建起 Data Pipeline,成了亟待解决的问题。并且伴随云原生时代到来,云原生天生具有的自动化部署和交付能力也正催化这一过程。本文就主要介绍如何利用 Iceberg [1] 与 Kubernetes 打造新一代云原生数据湖。 何为 Iceberg Apache Iceberg is an open table format for huge analytic datasets. Iceberg adds tables to Presto and Spark that use a

使用 Iceberg on Kubernetes 打造新一代云原生数据湖

只愿长相守 提交于 2021-02-05 02:42:45
作者徐蓓,腾讯云容器专家工程师,10年研发经验,7年云计算领域经验。负责腾讯云 TKE 大数据云原生、离在线混部、Serverless 架构与研发。 背景 大数据发展至今,按照 Google 2003年发布的《The Google File System》第一篇论文算起,已走过17个年头。可惜的是 Google 当时并没有开源其技术,“仅仅”是发表了三篇技术论文。所以回头看,只能算是揭开了大数据时代的帷幕。随着 Hadoop 的诞生,大数据进入了高速发展的时代,大数据的红利及商业价值也不断被释放。现今大数据存储和处理需求越来越多样化,在后 Hadoop 时代,如何构建一个统一的数据湖存储,并在其上进行多种形式的数据分析,成了企业构建大数据生态的一个重要方向。怎样快速、一致、原子性地在数据湖存储上构建起 Data Pipeline,成了亟待解决的问题。并且伴随云原生时代到来,云原生天生具有的自动化部署和交付能力也正催化这一过程。本文就主要介绍如何利用 Iceberg [1] 与 Kubernetes 打造新一代云原生数据湖。 何为 Iceberg Apache Iceberg is an open table format for huge analytic datasets. Iceberg adds tables to Presto and Spark that use a

Hive number of reducers in group by and count(distinct)

梦想与她 提交于 2021-02-04 21:09:57
问题 I was told that count(distinct ) may result in data skew because only one reducer is used. I made a test using a table with 5 billion data with 2 queries, Query A: select count(distinct columnA) from tableA Query B: select count(columnA) from (select columnA from tableA group by columnA) a Actually, query A takes about 1000-1500 seconds while query B takes 500-900 seconds. The result seems expected. However, I realize that both queries use 370 mappers and 1 reducers and thay have almost the

Hive number of reducers in group by and count(distinct)

老子叫甜甜 提交于 2021-02-04 21:09:17
问题 I was told that count(distinct ) may result in data skew because only one reducer is used. I made a test using a table with 5 billion data with 2 queries, Query A: select count(distinct columnA) from tableA Query B: select count(columnA) from (select columnA from tableA group by columnA) a Actually, query A takes about 1000-1500 seconds while query B takes 500-900 seconds. The result seems expected. However, I realize that both queries use 370 mappers and 1 reducers and thay have almost the

Hive number of reducers in group by and count(distinct)

守給你的承諾、 提交于 2021-02-04 21:09:04
问题 I was told that count(distinct ) may result in data skew because only one reducer is used. I made a test using a table with 5 billion data with 2 queries, Query A: select count(distinct columnA) from tableA Query B: select count(columnA) from (select columnA from tableA group by columnA) a Actually, query A takes about 1000-1500 seconds while query B takes 500-900 seconds. The result seems expected. However, I realize that both queries use 370 mappers and 1 reducers and thay have almost the

Dynamic partitioning in Hive through the exact inserted timestamp

回眸只為那壹抹淺笑 提交于 2021-02-04 21:06:34
问题 I need to insert data to a given external table which should be partitioned by the inserted date. My question is how is Hive handling the timestamp generation? When I select a timestamp for all inserted records like this: WITH delta_insert AS ( SELECT trg.*, from_unixtime(unix_timestamp()) AS generic_timestamp FROM target_table trg ) SELECT * FROM delta_insert; Will the timestamp always be identical for all records, even if the query takes a lot of time to un? Or should I alternatively only

Dynamic partitioning in Hive through the exact inserted timestamp

大兔子大兔子 提交于 2021-02-04 21:05:39
问题 I need to insert data to a given external table which should be partitioned by the inserted date. My question is how is Hive handling the timestamp generation? When I select a timestamp for all inserted records like this: WITH delta_insert AS ( SELECT trg.*, from_unixtime(unix_timestamp()) AS generic_timestamp FROM target_table trg ) SELECT * FROM delta_insert; Will the timestamp always be identical for all records, even if the query takes a lot of time to un? Or should I alternatively only

Dynamic partitioning in Hive through the exact inserted timestamp

╄→尐↘猪︶ㄣ 提交于 2021-02-04 21:05:33
问题 I need to insert data to a given external table which should be partitioned by the inserted date. My question is how is Hive handling the timestamp generation? When I select a timestamp for all inserted records like this: WITH delta_insert AS ( SELECT trg.*, from_unixtime(unix_timestamp()) AS generic_timestamp FROM target_table trg ) SELECT * FROM delta_insert; Will the timestamp always be identical for all records, even if the query takes a lot of time to un? Or should I alternatively only

金灿灿的季节

夙愿已清 提交于 2021-02-04 04:26:14
在这个金灿灿的收获季节,经过 Apache DolphinScheduler PPMC 们的推荐和投票,Apache DolphinScheduler 收获了 5 位新Committer 。他们是:nauu(朱凯)、Rubik-W(温合民)、gabrywu、liwenhe1993、clay4444。 对于成为 Committer ,小伙伴们说道: 朱凯 : 非常荣幸能够成为DolphinSchedule 的 Committer。这既是一份喜悦,也是一份责任。我将以终为始,继续打怪升级,助力 DS 早日毕业。 温合民 : 很荣幸成为DS Committer团队的一员。通过技术调研了解到DS,最终选型决定引入DS,高效的社区支持使项目最终顺利落地。DS是我参与开源的第一个项目,深受益于开源,同时也想为开源做一些力所能及的贡献,希望未来能更多的为DS添砖加瓦,愿DS顺利毕业。 社区介绍: Apache DolphinScheduler 是一个非常多样化的社区,至今贡献者已近100名, 他们分别来自 30 多家不同的公司。 微信群用户3000人。 Apache DolphinScheduler 部分用户案例(排名不分先后) 已经有300多家企业和科研机构在使用DolphinScheduler,来处理各类调度和定时任务,另有 近500家 公司开通了海豚调度的试用: Apache