问题
In MySQL, I can have a query like this:
select
cast(from_unixtime(t.time, '%Y-%m-%d %H:00') as datetime) as timeHour
, ...
from
some_table t
group by
timeHour, ...
order by
timeHour, ...
where timeHour
in the GROUP BY
is the result of a select expression.
But I just tried a query similar to that in Sqark SQL
, and I got an error of
Error: org.apache.spark.sql.AnalysisException:
cannot resolve '`timeHour`' given input columns: ...
My query for Spark SQL
was this:
select
cast(t.unixTime as timestamp) as timeHour
, ...
from
another_table as t
group by
timeHour, ...
order by
timeHour, ...
Is this construct possible in Spark SQL
?
回答1:
Is this construct possible in Spark SQL?
Yes, It is. You can make it works in Spark SQL in 2 ways to use new column in GROUP BY
and ORDER BY
clauses
Approach 1 using sub query :
SELECT timeHour, someThing FROM (SELECT
from_unixtime((starttime/1000)) AS timeHour
, sum(...) AS someThing
, starttime
FROM
some_table)
WHERE
starttime >= 1000*unix_timestamp('2017-09-16 00:00:00')
AND starttime <= 1000*unix_timestamp('2017-09-16 04:00:00')
GROUP BY
timeHour
ORDER BY
timeHour
LIMIT 10;
Approach 2 using WITH // elegant way :
-- create alias
WITH table_aliase AS(SELECT
from_unixtime((starttime/1000)) AS timeHour
, sum(...) AS someThing
, starttime
FROM
some_table)
-- use the same alias as table
SELECT timeHour, someThing FROM table_aliase
WHERE
starttime >= 1000*unix_timestamp('2017-09-16 00:00:00')
AND starttime <= 1000*unix_timestamp('2017-09-16 04:00:00')
GROUP BY
timeHour
ORDER BY
timeHour
LIMIT 10;
Alternative using Spark DataFrame(wo SQL) API with Scala :
// This code may need additional import to work well
val df = .... //load the actual table as df
import org.apache.spark.sql.functions._
df.withColumn("timeHour", from_unixtime($"starttime"/1000))
.groupBy($"timeHour")
.agg(sum("...").as("someThing"))
.orderBy($"timeHour")
.show()
//another way - as per eliasah comment
df.groupBy(from_unixtime($"starttime"/1000).as("timeHour"))
.agg(sum("...").as("someThing"))
.orderBy($"timeHour")
.show()
回答2:
I am trying to provide answer myself here ...
It seems to me that we have to rewrite the query and repeat the computation of the select expression in the GROUP BY clause. For example:
select
from_unixtime((t.starttime/1000)) as timeHour
, sum(...) as someThing
from
some_table as t
where
t.starttime>=1000*unix_timestamp('2017-09-16 00:00:00')
and t.starttime<=1000*unix_timestamp('2017-09-16 04:00:00')
group by
from_unixtime((t.starttime/1000))
order by
from_unixtime((t.starttime/1000))
limit 10;
来源:https://stackoverflow.com/questions/46395333/reuse-the-result-of-a-select-expression-in-the-group-by-clause