Spark SQL on ORC files doesn't return correct Schema (Column names)

后端 未结 5 1095
甜味超标
甜味超标 2020-12-21 09:55

I have a directory containing ORC files. I am creating a DataFrame using the below code

var data = sqlContext.sql(\"SELECT * FROM orc.`/directory/containing/         


        
5条回答
  •  自闭症患者
    2020-12-21 09:57

    If you have the parquet version as well, you can just copy the column names over, which is what I did (also, the date column was partition key for orc so had to move it to the end):

    tx = sqlContext.table("tx_parquet")
    df = sqlContext.table("tx_orc")
    tx_cols = tx.schema.names
    tx_cols.remove('started_at_date')
    tx_cols.append('started_at_date') #move it to end
    #fix column names for orc
    oldColumns = df.schema.names
    newColumns = tx_cols
    df = functools.reduce(
        lambda df, idx: df.withColumnRenamed(
            oldColumns[idx], newColumns[idx]), range(
                len(oldColumns)), df)
    

提交回复
热议问题