How to Read a parquet file , change datatype and write to another Parquet file in Hadoop using pyspark

问题

My source parquet file has everything as string. My destination parquet file needs to convert this to different datatype like int, string, date etc. How do I do this?

回答1:

you may wanted to apply userdefined schema to speedup data loading. There are 2 ways to apply that-

using the input DDL-formatted string

spark.read.schema("a INT, b STRING, c DOUBLE").parquet("test.parquet")

Use StructType schema

customSchema = StructType([
        StructField("a", IntegerType(), True),
        StructField("b", StringType(), True),
        StructField("c", DoubleType(), True)])
spark.read.schema(customSchema).parquet("test.parquet")

回答2:

You should read the file and then typecast all the columns as required and save them

from pyspark.sql.functions import *
df = spark.read.parquet('/path/to/file')
df = df.select(col('col1').cast('int'), col('col2').cast('string'))
df.write.parquet('/target/path')

回答3:

Data File: | data_extract_id| Alien_Dollardiff| Alien_Dollar

|ab1def1gh-123-ea0| 0| 0

Script:

def createPrqtFParqt (datPath, parquetPath, inpustJsonSchema, outputdfSchema):
  print("## Parsing " + datPath)
  df = ssc.read.schema(outputdfSchema).parquet(datPath)
  print("## Writing " + parquetPath)
  df.write.mode("overwrite").parquet(parquetPath)

Output: An error occured while calling Parquet. Column: Alien_Dollardiff| Expected double Found BINARY.

来源：https://stackoverflow.com/questions/62650315/how-to-read-a-parquet-file-change-datatype-and-write-to-another-parquet-file-i

标签

python

apache-spark

Hadoop

pyspark

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!