How to apply Deep Feature Synthesis to a single table

前端未结

关注

 1  756

After processing, my data is one table with several columns that are features and one column which is a label. I would like to use featuretools.dfs to help me p

相关标签:

1条回答

执念已碎

2020-12-30 10:02

It is possible to run DFS on a single table. As an example, if you have a pandas dataframe df with index 'index', you would write:

import featuretools as ft
es = ft.EntitySet('Transactions')

es.entity_from_dataframe(dataframe=df,
                         entity_id='log',
                         index='index')

fm, features = ft.dfs(entityset=es, 
                      target_entity='log',
                      trans_primitives=['day', 'weekday', 'month'])

The generated feature matrix will look like

In [1]: fm
Out[1]: 
             location  pies sold  WEEKDAY(date)  MONTH(date)  DAY(date)
index                                                                  
1         main street          3              4           12         29
2         main street          4              5           12         30
3         main street          5              6           12         31
4      arlington ave.         18              0            1          1
5      arlington ave.          1              1            1          2

This will apply “transform” primitives to your data. You usually want to add more entities to give ft.dfs, in order to use aggregation primitives. You can read about the difference in our documentation.

A standard workflow is to normalize your single entity by an interesting categorical. If your df was the single table

| index | location       | pies sold |   date |
|-------+----------------+-------+------------|
|     1 | main street    |     3 | 2017-12-29 |
|     2 | main street    |     4 | 2017-12-30 |
|     3 | main street    |     5 | 2017-12-31 |
|     4 | arlington ave. |    18 | 2018-01-01 |
|     5 | arlington ave. |     1 | 2018-01-02 |

you would probably be interested in normalizing by location:

es.normalize_entity(base_entity_id='log',
                    new_entity_id='locations',
                    index='location')

Your new entity locations would have the table

| location       | first_log_time |
|----------------+----------------|
| main street    |     2018-12-29 |
| arlington ave. |     2000-01-01 |

which would make features like locations.SUM(log.pies sold) or locations.MEAN(log.pies sold) to add or average all values by location. You can see these features created in the example below

In [1]: import pandas as pd
   ...: import featuretools as ft
   ...: df = pd.DataFrame({'index': [1, 2, 3, 4, 5],
   ...:                    'location': ['main street',
   ...:                                 'main street',
   ...:                                 'main street',
   ...:                                 'arlington ave.',
   ...:                                 'arlington ave.'],
   ...:                    'pies sold': [3, 4, 5, 18, 1]})
   ...: df['date'] = pd.date_range('12/29/2017', periods=5, freq='D')
   ...: df
   ...: 

Out[1]: 
   index        location  pies sold       date
0      1     main street          3 2017-12-29
1      2     main street          4 2017-12-30
2      3     main street          5 2017-12-31
3      4  arlington ave.         18 2018-01-01
4      5  arlington ave.          1 2018-01-02

In [2]: es = ft.EntitySet('Transactions')
   ...: es.entity_from_dataframe(dataframe=df, entity_id='log', index='index', t
   ...: ime_index='date')
   ...: es.normalize_entity(base_entity_id='log', new_entity_id='locations', ind
   ...: ex='location')
   ...: 
Out[2]: 
Entityset: Transactions
  Entities:
    log [Rows: 5, Columns: 4]
    locations [Rows: 2, Columns: 2]
  Relationships:
    log.location -> locations.location

In [3]: fm, features = ft.dfs(entityset=es,
   ...:                       target_entity='log',
   ...:                       agg_primitives=['sum', 'mean'],
   ...:                       trans_primitives=['day'])
   ...: fm
   ...: 
Out[3]: 
             location  pies sold  DAY(date)  locations.DAY(first_log_time)  locations.MEAN(log.pies sold)  locations.SUM(log.pies sold)
index                                                                                                                                  
1         main street          3         29                             29                            4.0                            12
2         main street          4         30                             29                            4.0                            12
3         main street          5         31                             29                            4.0                            12
4      arlington ave.         18          1                              1                            9.5                            19
5      arlington ave.          1          2                              1                            9.5                            19

0 讨论(0)