Regression with Date variable using Scikit-learn

后端 未结 4 2026
夕颜
夕颜 2020-12-13 09:43

I have a Pandas DataFrame with a date column (eg: 2013-04-01) of dtype datetime.date. When I include that column in X_train

4条回答
  •  一向
    一向 (楼主)
    2020-12-13 10:29

    The best way is to explode the date into a set of categorical features encoded in boolean form using the 1-of-K encoding (e.g. as done by DictVectorizer). Here are some features that can be extracted from a date:

    • hour of the day (24 boolean features)
    • day of the week (7 boolean features)
    • day of the month (up to 31 boolean features)
    • month of the year (12 boolean features)
    • year (as many boolean features as they are different years in your dataset) ...

    That should make it possible to identify linear dependencies on periodic events on typical human life cycles.

    Additionally you can also extract the date a single float: convert each date as the number of days since the min date of your training set and divide by the difference of the number of days between the max date and the number of days of the min date. That numerical feature should make it possible to identify long term trends between the output of the event date: e.g. a linear slope in a regression problem to better predict evolution on forth-coming years that cannot be encoded with the boolean categorical variable for the year feature.

提交回复
热议问题