问题
I have a dataframe with a rough structure like the following:
+-------------------------+-------------------------+--------+
| timestamp | adj_timestamp | values |
+-------------------------+-------------------------+--------+
| 2017-05-31 15:30:48.000 | 2017-05-31 11:30:00.000 | 0 |
+-------------------------+-------------------------+--------+
| 2017-05-31 15:31:45.000 | 2017-05-31 11:30:00.000 | 0 |
+-------------------------+-------------------------+--------+
| 2017-05-31 15:32:49.000 | 2017-05-31 11:30:00.000 | 0 |
...
I am trying to apply a conversion function to the two time columns to turn them into their integer representation using the time
package. My user defined function and how it is applied to the dataframe above:
def timeConverter(timestamp):
time_tuple = time.strptime(timestamp, "%Y-%m-%d %H:%M:%S.000")
timevalue = time.mktime(time_tuple)
return timevalue
def convertDateColumn(Data):
timeUDF = udf(timeConverter,FloatType())
finalData = Data.withColumn('adj_timestamp', timeUDF('adj_timestamp'))
return finalData
For example, the first entry in the adj_timestamp
column becomes:
1496244608
Converting this back via datetime.fromtimestamp
results in: 2017-05-31 15:30:08
Which is not the same value that I started with... Curious as to what is going on!
EDIT: Since I have far more rows than the 3 shown, is it possible that the data is being processed asynchronously and therefore the resulting dataframe is not in the same order as it was fed in?
回答1:
For udf
, I'm not quite sure yet why it's not working. It might be float manipulation problem when converting Python function to UDF. See how using interger output works below. Alternatively, you can resolve using a Spark function called unix_timestamp
that allows you convert timestamp. I give an example below. Hope it helps a bit.
Here I create Spark dataframe from examples that you show,
import pandas as pd
df = pd.DataFrame([
['2017-05-31 15:30:48.000', '2017-05-31 11:30:00.000', 0],
['2017-05-31 15:31:45.000', '2017-05-31 11:30:00.000', 0],
['2017-05-31 15:32:49.000', '2017-05-31 11:30:00.000', 0]],
columns=['timestamp', 'adj_timestamp', 'values'])
df = spark.createDataFrame(df)
Solve by using Spark function
Apply fn.unix_timestamp
to the column timestamp
import pyspark.sql.functions as fn
from pyspark.sql.types import *
df.select(fn.unix_timestamp(fn.col('timestamp'), format='yyyy-MM-dd HH:mm:ss.000').alias('unix_timestamp')).show()
For the first column, the output looks like this
+--------------+
|unix_timestamp|
+--------------+
| 1496259048|
| 1496259105|
| 1496259169|
+--------------+
You can put this back to timestamp using datetime
library:
import datetime
datetime.datetime.fromtimestamp(1496259048) # output as datetime(2017, 5, 31, 15, 30, 48)
Solve by converting to interger instead of float
import datetime
import time
def timeConverter(timestamp):
time_tuple = datetime.datetime.strptime(timestamp, "%Y-%m-%d %H:%M:%S.000").timetuple()
timevalue = int(time.mktime(time_tuple)) # convert to int here
return timevalue
time_udf = fn.udf(timeConverter, IntegerType()) # output interger
df.select(time_udf(fn.col('timestamp')))
Here, we will get the same timestamp [1496259048, 1496259105, 1496259169]
as using unix_timestamp
.
来源:https://stackoverflow.com/questions/46122846/pyspark-inconsistency-in-converting-timestamp-to-integer-in-dataframe