Read in CSV in Pyspark with correct Datatypes

我的未来我决定 提交于 2019-12-05 16:21:29

Here you go!

"Customer","TransDate","Quantity","PurchAmount","Cost","TransID","TransKey"
149332,"15.11.2005",1,199.95,107,127998739,100000
PATH_TO_FILE="file:///u/vikrant/LocalTestDateFile"
Loading above file to dataframe:
df = spark.read.format("com.databricks.spark.csv") \
  .option("mode", "DROPMALFORMED") \
  .option("header", "true") \
  .option("inferschema", "true") \
  .option("delimiter", ",").load(PATH_TO_FILE)

your date will get loaded as string column type, but the moment you change it to date type it will treat this date format as NULL.

df = (df.withColumn('TransDate',col('TransDate').cast('date'))

+--------+---------+--------+-----------+----+---------+--------+
|Customer|TransDate|Quantity|PurchAmount|Cost|  TransID|TransKey|
+--------+---------+--------+-----------+----+---------+--------+
|  149332|     null|       1|     199.95| 107|127998739|  100000|
+--------+---------+--------+-----------+----+---------+--------+

So we need to change the date format from dd.mm.yy to yy-mm-dd.

from datetime import datetime
from pyspark.sql.functions import col, udf
from pyspark.sql.types import DateType
from pyspark.sql.functions import col

Python function to change the date format:

  change_dateformat_func =  udf (lambda x: datetime.strptime(x, '%d.%m.%Y').strftime('%Y-%m-%d'))

call this function for your dataframe column now:

newdf = df.withColumn('TransDate', change_dateformat_func(col('TransDate')).cast(DateType()))

+--------+----------+--------+-----------+----+---------+--------+
|Customer| TransDate|Quantity|PurchAmount|Cost|  TransID|TransKey|
+--------+----------+--------+-----------+----+---------+--------+
|  149332|2005-11-15|       1|     199.95| 107|127998739|  100000|
+--------+----------+--------+-----------+----+---------+--------+

and below is the Schema:

 |-- Customer: integer (nullable = true)
 |-- TransDate: date (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- PurchAmount: double (nullable = true)
 |-- Cost: integer (nullable = true)
 |-- TransID: integer (nullable = true)
 |-- TransKey: integer (nullable = true)

Let me know if it works for you.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!