问题
My source parquet file has everything as string. My destination parquet file needs to convert this to different datatype like int, string, date etc. How do I do this?
回答1:
you may wanted to apply userdefined schema to speedup data loading. There are 2 ways to apply that-
using the input DDL-formatted string
spark.read.schema("a INT, b STRING, c DOUBLE").parquet("test.parquet")
Use StructType schema
customSchema = StructType([
StructField("a", IntegerType(), True),
StructField("b", StringType(), True),
StructField("c", DoubleType(), True)])
spark.read.schema(customSchema).parquet("test.parquet")
回答2:
You should read the file and then typecast all the columns as required and save them
from pyspark.sql.functions import *
df = spark.read.parquet('/path/to/file')
df = df.select(col('col1').cast('int'), col('col2').cast('string'))
df.write.parquet('/target/path')
回答3:
Data File: | data_extract_id| Alien_Dollardiff| Alien_Dollar
|ab1def1gh-123-ea0| 0| 0
Script:
def createPrqtFParqt (datPath, parquetPath, inpustJsonSchema, outputdfSchema):
print("## Parsing " + datPath)
df = ssc.read.schema(outputdfSchema).parquet(datPath)
print("## Writing " + parquetPath)
df.write.mode("overwrite").parquet(parquetPath)
Output: An error occured while calling Parquet. Column: Alien_Dollardiff| Expected double Found BINARY.
来源:https://stackoverflow.com/questions/62650315/how-to-read-a-parquet-file-change-datatype-and-write-to-another-parquet-file-i