Is it possible to alias columns programmatically in spark sql?

て烟熏妆下的殇ゞ 提交于 2019-12-18 03:10:50

问题


In spark SQL (perhaps only HiveQL) one can do:

select sex, avg(age) as avg_age
from humans
group by sex

which would result in a DataFrame with columns named "sex" and "avg_age".

How can avg(age) be aliased to "avg_age" without using textual SQL?

Edit: After zero323 's answer, I need to add the constraint that:

The column-to-be-renamed's name may not be known/guaranteed or even addressable. In textual SQL, using "select EXPR as NAME" removes the requirement to have an intermediate name for EXPR. This is also the case in the example above, where "avg(age)" could get a variety of auto-generated names (which also vary among spark releases and sql-context backends).


回答1:


Let's suppose human_df is the DataFrame for humans. Since Spark 1.3:

human_df.groupBy("sex").agg(avg("age").alias("avg_age"))



回答2:


If you prefer to rename a single column it is possible to use withColumnRenamed method:

case class Person(name: String, age: Int)

val df = sqlContext.createDataFrame(
    Person("Alice", 2) :: Person("Bob", 5) :: Nil) 
df.withColumnRenamed("name", "first_name")

Alternatively you can use alias method:

import org.apache.spark.sql.functions.avg

df.select(avg($"age").alias("average_age")) 

You can take it further with small helper:

import org.apache.spark.sql.Column

def normalizeName(c: Column) = {
  val pattern = "\\W+".r
  c.alias(pattern.replaceAllIn(c.toString, "_"))
}

df.select(normalizeName(avg($"age")))



回答3:


Turns out def toDF(colNames: String*): DataFrame does exactly that. Pasting from 2.11.7 documentation:

def toDF(colNames: String*): DataFrame

Returns a new DataFrame with columns renamed. This can be quite
convenient in conversion from a RDD of tuples into a DataFrame
with meaningful names. For example:

    val rdd: RDD[(Int, String)] = ...
    rdd.toDF()  // this implicit conversion creates a DataFrame
                // with column name _1 and _2
    rdd.toDF("id", "name")  // this creates a DataFrame with
                            // column name "id" and "name"



回答4:


Anonymous columns, such as the one that would be generated by avg(age) without AS avg_age, get automatically assigned names. As you point out in your question, the names are implementation-specific, generated by a naming strategy. If needed, you could write code that sniffs the environment and instantiates an appropriate discovery & renaming strategy based on the specific naming strategy. There are not many of them.

In Spark 1.4.1 with HiveContext, the format is "_cN" where N is the position of the anonymous column in the table. In your case, the name would be _c1.



来源:https://stackoverflow.com/questions/31538624/is-it-possible-to-alias-columns-programmatically-in-spark-sql

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!