Hive

Hive: More clean way to SELECT AS and GROUP BY

南笙酒味 提交于 2020-05-25 06:46:25
问题 I try to write Hive Sql like that SELECT count(1), substr(date, 1, 4) as year FROM *** GROUP BY year But Hive cannot recognize the alias name 'year', it complains that: FAILED: SemanticException [Error 10004]: Line 1:79 Invalid table alias or column reference 'year' One solution(Hive: SELECT AS and GROUP BY) suggest to use 'GROUP BY substr(date, 1, 4)'. It works! However in some cases the value I want to group by may be generated from multiple lines of hive function code , it's very ugly to

Hive: More clean way to SELECT AS and GROUP BY

妖精的绣舞 提交于 2020-05-25 06:44:19
问题 I try to write Hive Sql like that SELECT count(1), substr(date, 1, 4) as year FROM *** GROUP BY year But Hive cannot recognize the alias name 'year', it complains that: FAILED: SemanticException [Error 10004]: Line 1:79 Invalid table alias or column reference 'year' One solution(Hive: SELECT AS and GROUP BY) suggest to use 'GROUP BY substr(date, 1, 4)'. It works! However in some cases the value I want to group by may be generated from multiple lines of hive function code , it's very ugly to

Assign same value when using lag function if column used in lag has same value

岁酱吖の 提交于 2020-05-24 03:57:27
问题 I have a table in sql contents are below +---+----------+----------+----------+--------+ | pk| from_d| to_d| load_date| row_num| +---+----------+----------+----------+--------+ |111|2019-03-03|2019-03-03|2019-03-03| 1| |111|2019-02-02|2019-02-02|2019-02-02| 2| |111|2019-02-02|2019-02-02|2019-02-02| 2| |111|2019-01-01|2019-01-01|2019-01-01| 3| |222|2019-03-03|2019-03-03|2019-03-03| 1| |222|2019-01-01|2019-01-01|2019-01-01| 2| |333|2019-02-02|2019-02-02|2019-02-02| 1| |333|2019-01-01|2019-01-01

In Spark streaming, Is it possible to upsert batch data from kafka to Hive?

旧城冷巷雨未停 提交于 2020-05-17 08:52:05
问题 My plan is: 1. using spark streaming to load data from kafka every period like 1 minute. 2. convert the data loading every 1 min into DataFrame. 3. upsert the DataFrame into a Hive table (a table storing all history data) Currently, I successfully implemented the step1-2. And I want to know if there is any practical way to realize the step3. In detail: 1. load the latest history table with a certain partition in spark streaming. 2. use batch DataFrame to join the history table/DataFrame with

How to resolve com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast… Java Spark

雨燕双飞 提交于 2020-05-17 06:31:05
问题 Hi I am new to Java Spark, and have been looking for solutions for couple of days. I am working on loading MongoDB data into hive table, however, I found some error while saveAsTable that occurs this error com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast STRING into a StructType(StructField(oid,StringType,true)) (value: BsonString{value='54d3e8aeda556106feba7fa2'}) I've tried increase the sampleSize, different mongo-spark-connector versions, ... but non of working

AWS Athena null values are replaced by N after table is created. How to keep them as it is?

大憨熊 提交于 2020-05-17 06:22:05
问题 I'm creating a table in Athena from csv data in S3. The data has some columns quoted, so I use: ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = ",", 'serialization.null.format' = '') The serde works fine but then the null values in the resultant table are replaced with N. How can I keep the null values as empty or like Null etc, but not as N. Thanks. 来源: https://stackoverflow.com/questions/61020631/aws-athena-null-values-are-replaced-by-n

SaveAsTable in Spark Scala: HDP3.x

不羁岁月 提交于 2020-05-17 06:08:08
问题 I have one dataframe in Spark I'm saving it in my hive as a table.But getting below error message. java.lang.RuntimeException: com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector does not allow create table as select.at scala.sys.package$.error(package.scala:27) can anyone please help me how should i save this as table in hive. val df3 = df1.join(df2, df1("inv_num") === df2("inv_num") // Join both dataframes on id column ).withColumn("finalSalary", when(df1("salary") < df2("salary"),

Creating and using Spark-Hive UDF for Date

坚强是说给别人听的谎言 提交于 2020-05-17 05:54:05
问题 Note: This Quetion is Linked from this Question:Creting UDF function with NonPrimitive Data Type and using in Spark-sql Query: Scala Hi I have Craeted one method in scala. package test.udf.demo object UDF_Class { def transformDate( dateColumn: String, df: DataFrame) : DataFrame = { val sparksession = SparkSession.builder().appName("App").getOrCreate() val d=df.withColumn("calculatedCol", month(to_date(from_unixtime(unix_timestamp(col(dateColumn), "dd-MM-yyyy"))))) df.withColumn("date1", when

hive table gives error Unimplemented type

▼魔方 西西 提交于 2020-05-16 22:36:55
问题 Using spark-sql-2.4.1, and writing a parquet file with schema containing |-- avg: double (nullable = true) While reading the same using val df = spark.read.format("parquet").load(); Getting error: UnsupportedOperationException: Unimplemented type: DoubleType. So what is wrong here, and how to fix this? Stack Trace: Caused by: java.lang.UnsupportedOperationException: Unimplemented type: DoubleType at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readIntBatch

Why am I getting negative allocated mappers in Tez job? Vertex failure?

流过昼夜 提交于 2020-05-16 05:13:10
问题 I'm trying to use the PhoenixStorageHandler as documented here, and populate it with the following query in beeline shell: insert into table pheonix_table select * from hive_table; I get the following breakdown of the mappers in the Tez session: ... INFO : Map 1: 0(+50)/50 INFO : Map 1: 0(+50)/50 INFO : Map 1: 0(+50,-2)/50 INFO : Map 1: 0(+50,-3)/50 ... before the session crashes with a very long error message (422 lines) about vertex failure: Error: Error while processing statement: FAILED: