A lot of times, I have a big dataframe df to hold the basic data, and need to create many more columns to hold the derivative data calculated by basic data columns.
You could subclass DataFrame, and add the column as a property. For example,
import pandas as pd
class LazyFrame(pd.DataFrame):
@property
def derivative_col1(self):
self['derivative_col1'] = result = self['basic_col1'] + self['basic_col2']
return result
x = LazyFrame({'basic_col1':[1,2,3],
'basic_col2':[4,5,6]})
print(x)
# basic_col1 basic_col2
# 0 1 4
# 1 2 5
# 2 3 6
Accessing the property (via x.derivative_col1, below) calls the derivative_col1 function defined in LazyFrame. This function computes the result and adds the derived column to the LazyFrame instance:
print(x.derivative_col1)
# 0 5
# 1 7
# 2 9
print(x)
# basic_col1 basic_col2 derivative_col1
# 0 1 4 5
# 1 2 5 7
# 2 3 6 9
Note that if you modify a basic column:
x['basic_col1'] *= 10
the derived column is not automatically updated:
print(x['derivative_col1'])
# 0 5
# 1 7
# 2 9
But if you access the property, the values are recomputed:
print(x.derivative_col1)
# 0 14
# 1 25
# 2 36
print(x)
# basic_col1 basic_col2 derivative_col1
# 0 10 4 14
# 1 20 5 25
# 2 30 6 36
Starting in 0.13 (releasing very soon), you can do something like this. This is using generators to evaluate a dynamic formula. In-line assignment via eval will be an additional feature in 0.13, see here
In [19]: df = DataFrame(randn(5, 2), columns=['a', 'b'])
In [20]: df
Out[20]:
a b
0 -1.949107 -0.763762
1 -0.382173 -0.970349
2 0.202116 0.094344
3 -1.225579 -0.447545
4 1.739508 -0.400829
In [21]: formulas = [ ('c','a+b'), ('d', 'a*c')]
Create a generator that evaluates a formula using eval; assigns the result, then yields the result.
In [22]: def lazy(x, formulas):
....: for col, f in formulas:
....: x[col] = x.eval(f)
....: yield x
....:
In action
In [23]: gen = lazy(df,formulas)
In [24]: gen.next()
Out[24]:
a b c
0 -1.949107 -0.763762 -2.712869
1 -0.382173 -0.970349 -1.352522
2 0.202116 0.094344 0.296459
3 -1.225579 -0.447545 -1.673123
4 1.739508 -0.400829 1.338679
In [25]: gen.next()
Out[25]:
a b c d
0 -1.949107 -0.763762 -2.712869 5.287670
1 -0.382173 -0.970349 -1.352522 0.516897
2 0.202116 0.094344 0.296459 0.059919
3 -1.225579 -0.447545 -1.673123 2.050545
4 1.739508 -0.400829 1.338679 2.328644
So its user determined ordering for the evaluation (and not on-demand). In theory numba is going to support this, so pandas possibly support this as a backend for eval (which currently uses numexpr for immediate evaluation).
my 2c.
lazy evaluation is nice, but can easily be achived by using python's own continuation/generate features, so building it into pandas, while possible, is quite tricky, and would need a really nice usecase to be generally useful.