Creating binned histograms in Spark

后端未结

关注

 2  814

夕颜 2021-01-07 08:37

Suppose I have a dataframe (df) (Pandas) or RDD (Spark) with the following two columns:

timestamp, data
12345.0    10 
12346.0    12

In Pa

2条回答

谎友^ (楼主)

2021-01-07 09:13

Here is an answer using RDDs and not dataframes:

# Generating some data to test with 
import random
import datetime

startTS = 12345.0
array = [(startTS+60*k, random.randrange(10, 20)) for k in range(150)]

# Initializing a RDD
rdd = sc.parallelize(array)

# I first map the timestamps to datetime objects so I can use the datetime.replace 
# method to round the times
formattedRDD = (rdd
                .map(lambda (ts, data): (datetime.fromtimestamp(int(ts)), data))
                .cache())

# Putting the minute and second fields to zero in datetime objects is 
# exactly like rounding per hour. You can then reduceByKey to aggregate bins.
hourlyRDD = (formattedRDD
             .map(lambda (time, msg): (time.replace(minute=0, second=0), 1))
             .reduceByKey(lambda a, b : a + b))

hourlyHisto = hourlyRDD.collect()
print hourlyHisto
> [(datetime.datetime(1970, 1, 1, 4, 0), 60), (datetime.datetime(1970, 1, 1, 5, 0), 55), (datetime.datetime(1970, 1, 1, 3, 0), 35)]

In order to do daily aggregates you can use time.date() instead of time.replace(...). Also to bin per hour starting at a not-round date-time object you can increment the original time by the delta to the nearest round hour.

0 讨论(0)

查看其它2个回答