Python: Numpy and Pandas Transforming timestamp/data into one-hot-encoding

China☆狼群 提交于 2021-02-19 05:20:52

问题


I have a column of a dataframe that is like this

              time
0       2017-03-01 15:30:00
1       2017-03-01 16:00:00
2       2017-03-01 16:30:00
3       2017-03-01 17:00:00
4       2017-03-01 17:30:00
5       2017-03-01 18:00:00
6       2017-03-01 18:30:00
7       2017-03-01 19:00:00
8       2017-03-01 19:30:00
9       2017-03-01 20:00:00
10      2017-03-01 20:30:00
11      2017-03-01 21:00:00
12      2017-03-01 21:30:00
13      2017-03-01 22:00:00
.
.
.

I want to "encode" the time of the day. I want to do this by firsly assigning each half an-hour a integer number. Starting from

 00:30:00 --> 1
 01:00:00 --> 2
 01:30:00 --> 3
 02:00:00 --> 4
 02:30:00 --> 5

and so on. Therefore we would have 48 numbers (since there are 24 hours). I would like to find the fastest way of transforming my column into a list/column containing those values.

So far I can do this for one value. For instance

2*int(timeDF.iloc[0][11:13]) + int(int(timeDF.iloc[0][14:16])/30) would transform 15:30:00 into 31.

I think I could do this by doing a loop where instead of using 0 I use an index that loops through the length of the column. However is there a faster way?

one hot encoding

After finding those values, I would use some one-hot-encoder, I think sklearn has one. But the most difficult part is this

stupid solution

labels = []
for date in time:
    labels.append(2*int(date[11:13]) + int(int(date[14:16])/30))

This would contain the values and then one could do something like here


回答1:


I think you need map with get_dummies.

Also it seems for first time 0:00 need 0, 0:30 - 1 so using range(48)

#convert to datetimes if necessary
df['time'] = pd.to_datetime(df['time'])

#create dictionary for map
a = dict(zip(pd.date_range('2010-01-01', '2010-01-01 23:59:39', freq='30T').time, range(48)))

#convert time column to times and map by dict
df['a'] = df['time'].dt.time.map(a)
print (df)
                  time   a
0  2017-03-01 15:30:00  31
1  2017-03-01 16:00:00  32
2  2017-03-01 16:30:00  33
3  2017-03-01 17:00:00  34
4  2017-03-01 17:30:00  35
5  2017-03-01 18:00:00  36
6  2017-03-01 18:30:00  37
7  2017-03-01 19:00:00  38
8  2017-03-01 19:30:00  39
9  2017-03-01 20:00:00  40
10 2017-03-01 20:30:00  41
11 2017-03-01 21:00:00  42
12 2017-03-01 21:30:00  43
13 2017-03-01 22:00:00  44

#for one hot encoding use get_dummies
df1 = pd.get_dummies(df['time'].dt.time.map(a))
print (df1)
    31  32  33  34  35  36  37  38  39  40  41  42  43  44
0    1   0   0   0   0   0   0   0   0   0   0   0   0   0
1    0   1   0   0   0   0   0   0   0   0   0   0   0   0
2    0   0   1   0   0   0   0   0   0   0   0   0   0   0
3    0   0   0   1   0   0   0   0   0   0   0   0   0   0
4    0   0   0   0   1   0   0   0   0   0   0   0   0   0
5    0   0   0   0   0   1   0   0   0   0   0   0   0   0
6    0   0   0   0   0   0   1   0   0   0   0   0   0   0
7    0   0   0   0   0   0   0   1   0   0   0   0   0   0
8    0   0   0   0   0   0   0   0   1   0   0   0   0   0
9    0   0   0   0   0   0   0   0   0   1   0   0   0   0
10   0   0   0   0   0   0   0   0   0   0   1   0   0   0
11   0   0   0   0   0   0   0   0   0   0   0   1   0   0
12   0   0   0   0   0   0   0   0   0   0   0   0   1   0
13   0   0   0   0   0   0   0   0   0   0   0   0   0   1

EDIT:

df1 = pd.get_dummies(df['time'].dt.time.map(a)).reindex(columns=range(48), fill_value=0)
    0   1   2   3   4   5   6   7   8   9  ...  38  39  40  41  42  43  44  \
0    0   0   0   0   0   0   0   0   0   0 ...   0   0   0   0   0   0   0   
1    0   0   0   0   0   0   0   0   0   0 ...   0   0   0   0   0   0   0   
2    0   0   0   0   0   0   0   0   0   0 ...   0   0   0   0   0   0   0   
3    0   0   0   0   0   0   0   0   0   0 ...   0   0   0   0   0   0   0   
4    0   0   0   0   0   0   0   0   0   0 ...   0   0   0   0   0   0   0   
5    0   0   0   0   0   0   0   0   0   0 ...   0   0   0   0   0   0   0   
6    0   0   0   0   0   0   0   0   0   0 ...   0   0   0   0   0   0   0   
7    0   0   0   0   0   0   0   0   0   0 ...   1   0   0   0   0   0   0   
8    0   0   0   0   0   0   0   0   0   0 ...   0   1   0   0   0   0   0   
9    0   0   0   0   0   0   0   0   0   0 ...   0   0   1   0   0   0   0   
10   0   0   0   0   0   0   0   0   0   0 ...   0   0   0   1   0   0   0   
11   0   0   0   0   0   0   0   0   0   0 ...   0   0   0   0   1   0   0   
12   0   0   0   0   0   0   0   0   0   0 ...   0   0   0   0   0   1   0   
13   0   0   0   0   0   0   0   0   0   0 ...   0   0   0   0   0   0   1   

    45  46  47  
0    0   0   0  
1    0   0   0  
2    0   0   0  
3    0   0   0  
4    0   0   0  
5    0   0   0  
6    0   0   0  
7    0   0   0  
8    0   0   0  
9    0   0   0  
10   0   0   0  
11   0   0   0  
12   0   0   0  
13   0   0   0  

[14 rows x 48 columns]



回答2:


I think this is what you are looking for i.e

x =pd.date_range("00:30", "23:30", freq="30min",format="%HH:%MM").astype(str).str[-8:]
maps = dict(zip(x,np.arange(1,48)))
df['new'] = df['time'].astype(str).str[-8:].map(maps)
pd.get_dummies(df['new']).set_index(df['time'])

Output:

                     31  32  33  34  35  36  37  38  39  40  41  42  43  44
time                                                                       
2017-03-01 15:30:00   1   0   0   0   0   0   0   0   0   0   0   0   0   0
2017-03-01 16:00:00   0   1   0   0   0   0   0   0   0   0   0   0   0   0
2017-03-01 16:30:00   0   0   1   0   0   0   0   0   0   0   0   0   0   0
2017-03-01 17:00:00   0   0   0   1   0   0   0   0   0   0   0   0   0   0
2017-03-01 17:30:00   0   0   0   0   1   0   0   0   0   0   0   0   0   0
2017-03-01 18:00:00   0   0   0   0   0   1   0   0   0   0   0   0   0   0
2017-03-01 18:30:00   0   0   0   0   0   0   1   0   0   0   0   0   0   0
2017-03-01 19:00:00   0   0   0   0   0   0   0   1   0   0   0   0   0   0
2017-03-01 19:30:00   0   0   0   0   0   0   0   0   1   0   0   0   0   0
2017-03-01 20:00:00   0   0   0   0   0   0   0   0   0   1   0   0   0   0
2017-03-01 20:30:00   0   0   0   0   0   0   0   0   0   0   1   0   0   0
2017-03-01 21:00:00   0   0   0   0   0   0   0   0   0   0   0   1   0   0
2017-03-01 21:30:00   0   0   0   0   0   0   0   0   0   0   0   0   1   0
2017-03-01 22:00:00   0   0   0   0   0   0   0   0   0   0   0   0   0   1


来源:https://stackoverflow.com/questions/46607306/python-numpy-and-pandas-transforming-timestamp-data-into-one-hot-encoding

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!