How to convert int64 datatype columns of parquet file to timestamp in SparkSQL data frame?

问题

Here my DataFrame looks like this:

+----------------+-------------+
|   Business_Date|         Code|
+----------------+-------------+
|1539129600000000|          BSD|
|1539129600000000|          BTN|
|1539129600000000|          BVI|
|1539129600000000|          BWP|
|1539129600000000|          BYB|
+----------------+-------------+

I wanted to convert the Business_Date column from bigint to timestamp value while loading data into hive table.

How can I do this?

回答1:

You can use pyspark.sql.functions.from_unixtime() which will

Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the given format.

It appears that your Business_Date needs to be divided by 1M to convert to seconds.

For example:

from pyspark.sql.functions import from_unixtime, col

df = df.withColumn(
    "Business_Date",
    from_unixtime(col("Business_Date")/1000000).cast("timestamp")
)
df.show()
#+---------------------+----+
#|Business_Date        |Code|
#+---------------------+----+
#|2018-10-09 20:00:00.0|BSD |
#|2018-10-09 20:00:00.0|BTN |
#|2018-10-09 20:00:00.0|BVI |
#|2018-10-09 20:00:00.0|BWP |
#|2018-10-09 20:00:00.0|BYB |
#+---------------------+----+

from_unixtime returns a string so you can cast the result to a timestamp.

Now the new schema:

df.printSchema()
#root
# |-- Business_Date: timestamp (nullable = true)
# |-- Code: string (nullable = true)

来源：https://stackoverflow.com/questions/54353974/how-to-convert-int64-datatype-columns-of-parquet-file-to-timestamp-in-sparksql-d

标签

apache-spark

Hive

pyspark

apache-spark-sql

pyspark-sql

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!