scala - how to substring column names after the last dot?

倖福魔咒の 提交于 2021-02-08 11:27:34

问题


After exploding a nested structure I have a DataFrame with column names like this:

sales_data.metric1
sales_data.type.metric2
sales_data.type3.metric3

When performing a select I'm getting the error:

cannot resolve 'sales_data.metric1' given input columns: [sales_data.metric1, sales_data.type.metric2, sales_data.type3.metric3]

How should I select from the DataFrame so the column names are parsed correctly?

I've tried the following: the substrings after dots are extracted successfully. But since I also have columns without dots like date - their names are getting removed completely.

var salesDf_new = salesDf 
for(col <- salesDf .columns){
  salesDf_new = salesDf_new.withColumnRenamed(col, StringUtils.substringAfterLast(col, "."))
}

I want to leave just metric1, metric2, metric3


回答1:


You can use backticks to select columns whose names include periods.

val df = (1 to 1000).toDF("column.a.b")

df.printSchema
// root
//  |-- column.a.b: integer (nullable = false)

df.select("`column.a.b`")

Also, you can rename them easily like this. Basically starting with your current DataFrame, keep updating it with a new column name for each field and return the final result.

val df2 = df.columns.foldLeft(df)(
    (myDF, col) => myDF.withColumnRenamed(col, col.replace(".", "_"))
)

EDIT: Get the last component

To rename with just the last name component, this regex will work:

val df2 = df.columns.foldLeft(df)(
    (myDF, col) => myDF.withColumnRenamed(col, col.replaceAll(".+\\.([^.]+)$", "$1"))
)

EDIT 2: Get the last two components

This is a little more complicated, and there might be a cleaner way to write this, but here is a way that works:

val pattern = (
    ".*?"  +          // Lazy match leading chars so we ignore that bits we don't want
    "([^.]+\\.)?" +   // Optional 2nd to last group
    "([^.]+)$"        // Last group
)

val df2 = df.columns.foldLeft(df)(
    (myDF, col) => myDF.withColumnRenamed(col, col.replaceAll(pattern, "$1$2"))
)
df2.printSchema


来源:https://stackoverflow.com/questions/51616666/scala-how-to-substring-column-names-after-the-last-dot

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!