How to apply Deep Feature Synthesis to a single table

前端 未结 1 756
忘掉有多难
忘掉有多难 2020-12-30 09:38

After processing, my data is one table with several columns that are features and one column which is a label. I would like to use featuretools.dfs to help me p

相关标签:
1条回答
  • 2020-12-30 10:02

    It is possible to run DFS on a single table. As an example, if you have a pandas dataframe df with index 'index', you would write:

    import featuretools as ft
    es = ft.EntitySet('Transactions')
    
    es.entity_from_dataframe(dataframe=df,
                             entity_id='log',
                             index='index')
    
    fm, features = ft.dfs(entityset=es, 
                          target_entity='log',
                          trans_primitives=['day', 'weekday', 'month'])
    

    The generated feature matrix will look like

    In [1]: fm
    Out[1]: 
                 location  pies sold  WEEKDAY(date)  MONTH(date)  DAY(date)
    index                                                                  
    1         main street          3              4           12         29
    2         main street          4              5           12         30
    3         main street          5              6           12         31
    4      arlington ave.         18              0            1          1
    5      arlington ave.          1              1            1          2
    

    This will apply “transform” primitives to your data. You usually want to add more entities to give ft.dfs, in order to use aggregation primitives. You can read about the difference in our documentation.

    A standard workflow is to normalize your single entity by an interesting categorical. If your df was the single table

    | index | location       | pies sold |   date |
    |-------+----------------+-------+------------|
    |     1 | main street    |     3 | 2017-12-29 |
    |     2 | main street    |     4 | 2017-12-30 |
    |     3 | main street    |     5 | 2017-12-31 |
    |     4 | arlington ave. |    18 | 2018-01-01 |
    |     5 | arlington ave. |     1 | 2018-01-02 |
    

    you would probably be interested in normalizing by location:

    es.normalize_entity(base_entity_id='log',
                        new_entity_id='locations',
                        index='location')
    

    Your new entity locations would have the table

    | location       | first_log_time |
    |----------------+----------------|
    | main street    |     2018-12-29 |
    | arlington ave. |     2000-01-01 |
    

    which would make features like locations.SUM(log.pies sold) or locations.MEAN(log.pies sold) to add or average all values by location. You can see these features created in the example below

    In [1]: import pandas as pd
       ...: import featuretools as ft
       ...: df = pd.DataFrame({'index': [1, 2, 3, 4, 5],
       ...:                    'location': ['main street',
       ...:                                 'main street',
       ...:                                 'main street',
       ...:                                 'arlington ave.',
       ...:                                 'arlington ave.'],
       ...:                    'pies sold': [3, 4, 5, 18, 1]})
       ...: df['date'] = pd.date_range('12/29/2017', periods=5, freq='D')
       ...: df
       ...: 
    
    Out[1]: 
       index        location  pies sold       date
    0      1     main street          3 2017-12-29
    1      2     main street          4 2017-12-30
    2      3     main street          5 2017-12-31
    3      4  arlington ave.         18 2018-01-01
    4      5  arlington ave.          1 2018-01-02
    
    In [2]: es = ft.EntitySet('Transactions')
       ...: es.entity_from_dataframe(dataframe=df, entity_id='log', index='index', t
       ...: ime_index='date')
       ...: es.normalize_entity(base_entity_id='log', new_entity_id='locations', ind
       ...: ex='location')
       ...: 
    Out[2]: 
    Entityset: Transactions
      Entities:
        log [Rows: 5, Columns: 4]
        locations [Rows: 2, Columns: 2]
      Relationships:
        log.location -> locations.location
    
    In [3]: fm, features = ft.dfs(entityset=es,
       ...:                       target_entity='log',
       ...:                       agg_primitives=['sum', 'mean'],
       ...:                       trans_primitives=['day'])
       ...: fm
       ...: 
    Out[3]: 
                 location  pies sold  DAY(date)  locations.DAY(first_log_time)  locations.MEAN(log.pies sold)  locations.SUM(log.pies sold)
    index                                                                                                                                  
    1         main street          3         29                             29                            4.0                            12
    2         main street          4         30                             29                            4.0                            12
    3         main street          5         31                             29                            4.0                            12
    4      arlington ave.         18          1                              1                            9.5                            19
    5      arlington ave.          1          2                              1                            9.5                            19
    
    0 讨论(0)
提交回复
热议问题