Hive

partitions in hive interview questions

六月ゝ 毕业季﹏ 提交于 2020-07-05 11:08:51
问题 1) If the partitioned column doesn't have data, so when you query on that, what error will you get? 2)If some rows doesn't have the partitioned column , the how those rows will be handled? will there be any data loss? 3)Why bucketing needs to be done with numeric column? Can we use string column also? what is the process and on what basis you will choose the bucketing column? 4) Will the internal table details will also be stored in the metastore? Or only external table details will be stored

partitions in hive interview questions

喜欢而已 提交于 2020-07-05 11:06:04
问题 1) If the partitioned column doesn't have data, so when you query on that, what error will you get? 2)If some rows doesn't have the partitioned column , the how those rows will be handled? will there be any data loss? 3)Why bucketing needs to be done with numeric column? Can we use string column also? what is the process and on what basis you will choose the bucketing column? 4) Will the internal table details will also be stored in the metastore? Or only external table details will be stored

Does Spark SQL use Hive Metastore?

蓝咒 提交于 2020-07-04 07:59:10
问题 I am developing a Spark SQL application and I've got few questions: I read that Spark-SQL uses Hive metastore under the cover? Is this true? I'm talking about a pure Spark-SQL application that does not explicitly connect to any Hive installation. I am starting a Spark-SQL application, and have no need to use Hive. Is there any reason to use Hive? From what I understand Spark-SQL is much faster than Hive; so, I don't see any reason to use Hive. But am I correct? 回答1: I read that Spark-SQL uses

generate where clause in bash using variables

大城市里の小女人 提交于 2020-07-03 15:54:53
问题 I have some variables in bash like below. min_date='2020-06-06' max_date='2020-06-08' max_seq_min_date=1 --This is till where data is processed max_seq_max_date=3 --This is till where data has to be processed batch_date is the column where I will use the min and max_date values For each date there will be 4 sequences I want to generate a where clause that I can use in sql query Now I want to generate a where clause that looks like below WHERE 1=1 or (batch_date = '2020-06-06' and seq_num in (

generate where clause in bash using variables

强颜欢笑 提交于 2020-07-03 15:54:51
问题 I have some variables in bash like below. min_date='2020-06-06' max_date='2020-06-08' max_seq_min_date=1 --This is till where data is processed max_seq_max_date=3 --This is till where data has to be processed batch_date is the column where I will use the min and max_date values For each date there will be 4 sequences I want to generate a where clause that I can use in sql query Now I want to generate a where clause that looks like below WHERE 1=1 or (batch_date = '2020-06-06' and seq_num in (

generate where clause in bash using variables

廉价感情. 提交于 2020-07-03 15:54:09
问题 I have some variables in bash like below. min_date='2020-06-06' max_date='2020-06-08' max_seq_min_date=1 --This is till where data is processed max_seq_max_date=3 --This is till where data has to be processed batch_date is the column where I will use the min and max_date values For each date there will be 4 sequences I want to generate a where clause that I can use in sql query Now I want to generate a where clause that looks like below WHERE 1=1 or (batch_date = '2020-06-06' and seq_num in (

Updating column values based on the other table values in hive tables

纵饮孤独 提交于 2020-06-29 04:04:17
问题 I have two tables like below in hive stg . This table is bascially snapshot table which will be overwritten everyday This table data will be inserted to history table every day in new partition Day 1 stg table +-----+------------+------------+ | pk | from_d | to_d | +-----+------------+------------+ | 111 | 2019-01-01 | 2019-01-01 | +-----+------------+------------+ | 222 | 2019-01-01 | 2019-01-01 | +-----+------------+------------+ | 333 | 2019-01-01 | 2019-01-01 | +-----+------------+------

Is it okay to use UUID as Surrogate Key for a datawarehouse in hive?

我们两清 提交于 2020-06-28 02:02:20
问题 For implementing surrogate keys in our hive data warehouse I have narrowed down to 2 options: 1) reflect('java.util.UUID','randomUUID') 2) INPUT__FILE__NAME + BLOCK__OFFSET__INSIDE__FILE Which of the above to is a better option to go with? Or would you suggest an even better one? Thank you. 回答1: For ORC and sequence files BLOCK__OFFSET__INSIDE__FILE is not unique per file and official documentation says that it is current block's first byte's file offset In some resources in the Internet it

Hive query: select a column based on the condition another columns values match some specific values, then create the match result as a new column

不问归期 提交于 2020-06-27 18:37:06
问题 I have to some query and creat columns operations in HiveQL. For example, app col1 app1 anybody love me? app2 I hate u app3 this hat is good app4 I don't like this one app5 oh my god app6 damn you. app7 such nice girl app8 xxxxx app9 pretty prefect app10 don't love me. app11 xxx anybody? I want to match a keyword list like ['anybody', 'love', 'you', 'xxx', 'don't'] and select the matched keyword result as a new column, named keyword as follows: app keyword app1 anybody, love app4 I don't like

How to enable or disable Hive support in spark-shell through Spark property (Spark 1.6)?

孤街浪徒 提交于 2020-06-25 18:11:28
问题 Is there any configuration property we can set it to disable / enable Hive support through spark-shell explicitly in spark 1.6. I tried to get all the sqlContext configuration properties with, sqlContext.getAllConfs.foreach(println) But, I am not sure on which property can actually required to disable/enable hive support. or Is there any other way to do this? 回答1: Spark >= 2.0 Enable and disable of Hive context is possible with config spark.sql.catalogImplementation Possible values for spark