PySpark: inconsistency in converting timestamp to integer in dataframe

问题

I have a dataframe with a rough structure like the following:

+-------------------------+-------------------------+--------+
| timestamp               | adj_timestamp           | values |
+-------------------------+-------------------------+--------+
| 2017-05-31 15:30:48.000 | 2017-05-31 11:30:00.000 | 0      |
+-------------------------+-------------------------+--------+
| 2017-05-31 15:31:45.000 | 2017-05-31 11:30:00.000 | 0      |
+-------------------------+-------------------------+--------+
| 2017-05-31 15:32:49.000 | 2017-05-31 11:30:00.000 | 0      |
...

I am trying to apply a conversion function to the two time columns to turn them into their integer representation using the time package. My user defined function and how it is applied to the dataframe above:

def timeConverter(timestamp):
    time_tuple = time.strptime(timestamp, "%Y-%m-%d %H:%M:%S.000")
    timevalue = time.mktime(time_tuple)
    return timevalue

def convertDateColumn(Data):
    timeUDF = udf(timeConverter,FloatType())
    finalData = Data.withColumn('adj_timestamp', timeUDF('adj_timestamp'))

    return finalData

For example, the first entry in the adj_timestamp column becomes: 1496244608

Converting this back via datetime.fromtimestamp results in: 2017-05-31 15:30:08

Which is not the same value that I started with... Curious as to what is going on!

EDIT: Since I have far more rows than the 3 shown, is it possible that the data is being processed asynchronously and therefore the resulting dataframe is not in the same order as it was fed in?

回答1:

For udf, I'm not quite sure yet why it's not working. It might be float manipulation problem when converting Python function to UDF. See how using interger output works below. Alternatively, you can resolve using a Spark function called unix_timestamp that allows you convert timestamp. I give an example below. Hope it helps a bit.

Here I create Spark dataframe from examples that you show,

import pandas as pd

df = pd.DataFrame([
    ['2017-05-31 15:30:48.000', '2017-05-31 11:30:00.000', 0], 
    ['2017-05-31 15:31:45.000', '2017-05-31 11:30:00.000', 0],
    ['2017-05-31 15:32:49.000', '2017-05-31 11:30:00.000', 0]], 
    columns=['timestamp', 'adj_timestamp', 'values'])
df = spark.createDataFrame(df)

Solve by using Spark function

Apply fn.unix_timestamp to the column timestamp

import pyspark.sql.functions as fn
from pyspark.sql.types import *
df.select(fn.unix_timestamp(fn.col('timestamp'), format='yyyy-MM-dd HH:mm:ss.000').alias('unix_timestamp')).show()

For the first column, the output looks like this

+--------------+
|unix_timestamp|
+--------------+
|    1496259048|
|    1496259105|
|    1496259169|
+--------------+

You can put this back to timestamp using datetime library:

import datetime
datetime.datetime.fromtimestamp(1496259048) # output as datetime(2017, 5, 31, 15, 30, 48)

Solve by converting to interger instead of float

import datetime
import time

def timeConverter(timestamp):
    time_tuple = datetime.datetime.strptime(timestamp, "%Y-%m-%d %H:%M:%S.000").timetuple()
    timevalue = int(time.mktime(time_tuple)) # convert to int here
    return timevalue

time_udf = fn.udf(timeConverter, IntegerType()) # output interger

df.select(time_udf(fn.col('timestamp')))

Here, we will get the same timestamp [1496259048, 1496259105, 1496259169] as using unix_timestamp.

来源：https://stackoverflow.com/questions/46122846/pyspark-inconsistency-in-converting-timestamp-to-integer-in-dataframe

标签

python

datetime

dataframe

pyspark

bigdata