Specify multiple columns data type changes to different data types in pyspark

问题

I have a DataFrame (df) which consists of more than 50 columns and different types of data types, such as

df3.printSchema()


     CtpJobId: string (nullable = true)
 |-- TransformJobStateId: string (nullable = true)
 |-- LastError: string (nullable = true)
 |-- PriorityDate: string (nullable = true)
 |-- QueuedTime: string (nullable = true)
 |-- AccurateAsOf: string (nullable = true)
 |-- SentToDevice: string (nullable = true)
 |-- StartedAtDevice: string (nullable = true)
 |-- ProcessStart: string (nullable = true)
 |-- LastProgressAt: string (nullable = true)
 |-- ProcessEnd: string (nullable = true)
 |-- ClipFirstFrameNumber: string (nullable = true)
 |-- ClipLastFrameNumber: double (nullable = true)
 |-- SourceNamedLocation: string (nullable = true)
 |-- TargetId: string (nullable = true)
 |-- TargetNamedLocation: string (nullable = true)
 |-- TargetDirectory: string (nullable = true)
 |-- TargetFilename: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- AssignedDeviceId: string (nullable = true)
 |-- DeviceResourceId: string (nullable = true)
 |-- DeviceName: string (nullable = true)
 |-- srcDropFrame: string (nullable = true)
 |-- srcDuration: double (nullable = true)
 |-- srcFrameRate: double (nullable = true)
 |-- srcHeight: double (nullable = true)
 |-- srcMediaFormat: string (nullable = true)
 |-- srcWidth: double (nullable = true)

Now I wants all one type columns can be changed in one go such as

timestamp_type = [
    'PriorityDate', 'QueuedTime', 'AccurateAsOf', 'SentToDevice', 
    'StartedAtDevice', 'ProcessStart', 'LastProgressAt', 'ProcessEnd'
]


integer_type = [
    'ClipFirstFrameNumber', 'ClipLastFrameNumber', 'TargetId', 'srcHeight',
    'srcMediaFormat', 'srcWidth'
]

I know how to do one by one as i'm doing now.

df3 = df3.withColumn("PriorityDate", df3["PriorityDate"].cast(TimestampType()))
df3 = df3.withColumn("QueuedTime", df3["QueuedTime"].cast(TimestampType()))
df3 = df3.withColumn("AccurateAsOf", df3["AccurateAsOf"].cast(TimestampType())

df3= df3.withColumn("srcMediaFormat", df3["srcMediaFormat"].cast(IntegerType()))
df3= df3.withColumn("DeviceResourceId", df3["DeviceResourceId"].cast(IntegerType()))
df3= df3.withColumn("AssignedDeviceId", df3["AssignedDeviceId"].cast(IntegerType()))

But this looks ugly and easily I can missed any column which I want to change. Is there any way I can write any function that will take care same type of list of columns to change.So I can easily implement convert_data_type and pass those columns names. Thanks in advance

回答1:

Instead of enumerating all of your values, you should use a loop:

for c in timestamp_type:
    df3 = df3.withColumn(c, df[c].cast(TimestampType()))

for c in integer_type:
    df3 = df3.withColumn(c, df[c].cast(IntegerType()))

Or equivalently, you can use functools.reduce:

from functools import reduce   # not needed in python 2
df3 = reduce(
    lambda df, c: df.withColumn(c, df[c].cast(TimestampType())), 
    timestamp_type,
    df3
)

df3 = reduce(
    lambda df, c: df.withColumn(c, df[c].cast(IntegerType())),
    integer_type,
    df3
)

来源：https://stackoverflow.com/questions/51521655/specify-multiple-columns-data-type-changes-to-different-data-types-in-pyspark

标签

python

pandas

apache-spark

pyspark

databricks