How to Read a parquet file , change datatype and write to another Parquet file in Hadoop using pyspark

爱⌒轻易说出口 提交于 2021-02-11 14:10:27

问题


My source parquet file has everything as string. My destination parquet file needs to convert this to different datatype like int, string, date etc. How do I do this?


回答1:


you may wanted to apply userdefined schema to speedup data loading. There are 2 ways to apply that-

using the input DDL-formatted string

spark.read.schema("a INT, b STRING, c DOUBLE").parquet("test.parquet")

Use StructType schema

customSchema = StructType([
        StructField("a", IntegerType(), True),
        StructField("b", StringType(), True),
        StructField("c", DoubleType(), True)])
spark.read.schema(customSchema).parquet("test.parquet")



回答2:


You should read the file and then typecast all the columns as required and save them

from pyspark.sql.functions import *
df = spark.read.parquet('/path/to/file')
df = df.select(col('col1').cast('int'), col('col2').cast('string'))
df.write.parquet('/target/path')



回答3:


Data File: | data_extract_id| Alien_Dollardiff| Alien_Dollar

|ab1def1gh-123-ea0| 0| 0

Script:

def createPrqtFParqt (datPath, parquetPath, inpustJsonSchema, outputdfSchema):
  print("## Parsing " + datPath)
  df = ssc.read.schema(outputdfSchema).parquet(datPath)
  print("## Writing " + parquetPath)
  df.write.mode("overwrite").parquet(parquetPath)

Output: An error occured while calling Parquet. Column: Alien_Dollardiff| Expected double Found BINARY.



来源:https://stackoverflow.com/questions/62650315/how-to-read-a-parquet-file-change-datatype-and-write-to-another-parquet-file-i

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!