问题
I am quite new to pyspark and this problem is boggling me. Basically I am looking for a scalable way to loop typecasting through a structType or ArrayType.
Example of my data schema:
root
|-- _id: string (nullable = true)
|-- created: timestamp (nullable = true)
|-- card_rates: struct (nullable = true)
| |-- rate_1: integer (nullable = true)
| |-- rate_2: integer (nullable = true)
| |-- rate_3: integer (nullable = true)
| |-- card_fee: integer (nullable = true)
| |-- payment_method: string (nullable = true)
|-- online_rates: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- rate_1: integer (nullable = true)
| | |-- rate_2: integer (nullable = true)
| | |-- online_fee: double (nullable = true)
|-- updated: timestamp (nullable = true)
As you can see here, card_rates
is struct and online_rates
is an array of struct. I am looking ways to loop through all the fields above and conditionally typecast them. Ideally if it is supposed to be numeric, it should be converted to double, if it is supposed to be string, It should be converted to string. I need to loop because those rate_*
fields may grow with time.
But for now, I am content with being able to loop them and typecast all of them to string since I am very new with pyspark and still trying to get a feel of it.
My desired output schema:
root
|-- _id: string (nullable = true)
|-- created: timestamp (nullable = true)
|-- card_rates: struct (nullable = true)
| |-- rate_1: double (nullable = true)
| |-- rate_2: double (nullable = true)
| |-- rate_3: double (nullable = true)
| |-- card_fee: double (nullable = true)
| |-- payment_method: string (nullable = true)
|-- online_rates: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- rate_1: double (nullable = true)
| | |-- rate_2: double (nullable = true)
| | |-- online_fee: double (nullable = true)
|-- updated: timestamp (nullable = true)
I am running out ideas how to do this.
I got reference from here: PySpark convert struct field inside array to string
but this solution hardcodes the field and does not really loop over the fields.
Kindly help.
回答1:
Here is one solution with the help of StructType.simpleString
and the _parse_datatype_string
build-in function:
from pyspark.sql.types import *
df_schema = StructType([
StructField("_id", StringType(), True),
StructField("created", TimestampType(), True),
StructField("card_rates", StructType([
StructField("rate_1", IntegerType(), True),
StructField("rate_2", IntegerType(), True),
StructField("rate_3", IntegerType(), True),
StructField("card_fee", IntegerType(), True),
StructField("card_fee", IntegerType(), True)])),
StructField("online_rates", ArrayType(
StructType(
[
StructField("rate_1", IntegerType(),True),
StructField("rate_2", IntegerType(),True),
StructField("online_fee", DoubleType(),True)
]),True),True),
StructField("updated", TimestampType(), True)])
schema_str = df_schema.simpleString() # this gives -> struct<_id:string,created:timestamp,card_rates:struct<rate_1:int,rate_2:int,rate_3:int,card_fee:int, card_fee:int>,online_rates:array<struct<rate_1:int,rate_2:int,online_fee:double>>,updated:timestamp>
double_schema = schema_str.replace(':int', ':double')
# convert back to StructType
final_schema = _parse_datatype_string(double_schema)
final_schema
- First convert your schema into a simple string with
schema.simpleString
- Then replace all
:int
with:double
- Finally convert the modified string schema into StructType with
_parse_datatype_string
UPDATE:
In order to avoid the issue with the backticks that @jxc pointed out a better solution would be a recursive scan through the elements as shown next:
def transform_schema(schema):
if schema == None:
return StructType()
updated = []
for f in schema.fields:
if isinstance(f.dataType, IntegerType):
# if IntegerType convert to DoubleType
updated.append(StructField(f.name, DoubleType(), f.nullable))
elif isinstance(f.dataType, ArrayType):
# if ArrayType unpack the array type(elementType), do recursion then wrap results with ArrayType
updated.append(StructField(f.name, ArrayType(transform_schema(f.dataType.elementType))))
elif isinstance(f.dataType, StructType):
# if StructType do recursion
updated.append(StructField(f.name, transform_schema(f.dataType)))
else:
# else handle all the other cases i.e TimestampType, StringType etc
updated.append(StructField(f.name, f.dataType, f.nullable))
return StructType(updated)
# call the function with your schema
transform_schema(df_schema)
Explanation: the function goes through each item on the schema (StructType) and tries to convert the int fields (StructField) into double. Finally delivers the converted schema (StructType) to the above layer (parent StructType).
Output:
StructType(List(
StructField(_id,StringType,true),
StructField(created,TimestampType,true),
StructField(card_rates,
StructType(List(StructField(rate_1,DoubleType,true),
StructField(rate_2,DoubleType,true),
StructField(rate_3,DoubleType,true),
StructField(card_fee,DoubleType,true),
StructField(card_fee,DoubleType,true))),true),
StructField(online_rates,ArrayType(
StructType(List(
StructField(rate_1,DoubleType,true),
StructField(rate_2,DoubleType,true),
StructField(online_fee,DoubleType,true))),true),true),
StructField(updated,TimestampType,true)))
UPDATE: (2020-02-02)
And here is one example on how to use the new schema together with the existing dataframe:
updated_schema = transform_schema(df.schema)
# cast each column to the new type
select_expr = [df[f.name].cast(f.dataType) for f in updated_schema.fields]
df.select(*select_expr).show()
来源:https://stackoverflow.com/questions/58697415/pyspark-looping-through-structtype-and-arraytype-to-do-typecasting-in-the-stru