Efficiently calculating point of control with pandas

馋奶兔 提交于 2021-02-08 10:35:13

问题


My algorithm stepped up from 35 seconds to 15 minutes runtime when implementing this feature over a daily timeframe. The algo retrieves daily history in bulk and iterates over a subset of the dataframe (from t0 to tX where tX is the current row of iteration). It does this to emulate what would happen during the real time operations of the algo. I know there are ways of improving it by utilizing memory between frame calculations but I was wondering if there was a more pandas-ish implementation that would see immediate benefit.

Assume that self.Step is something like 0.00001 and self.Precision is 5; they are used for binning the ohlc bar information into discrete steps for the sake of finding the poc. _frame is a subset of rows of the entire dataframe, and _low/_high are respective to that. The following block of code executes on the entire _frame which could be upwards of ~250 rows every time there is a new row added by the algo (when calculating yearly timeframe on daily data). I believe it's the iterrows that's causing the major slowdown. The dataframe has columns such as high, low, open, close, volume. I am calculating time price opportunity and volume point of control.

# Set the complete index of prices +/- 1 step due to weird floating point precision issues
volume_prices = pd.Series(0, index=np.around(np.arange(_low - self.Step, _high + self.Step, self.Step), decimals=self.Precision))
time_prices = volume_prices.copy()
for index, state in _frame.iterrows():
    _prices = np.around(np.arange(state.low, state.high, self.Step), decimals=self.Precision)
    # Evenly distribute the bar's volume over its range
    volume_prices[_prices] += state.volume / _prices.size
    # Increment time at price
    time_prices[_prices] += 1
# Pandas only returns the 1st row of the max value,
# so we need to reverse the series to find the other side
# and then find the average price between those two extremes
volume_poc = (volume_prices.idxmax() + volume_prices.iloc[::-1].idxmax()) / 2)
time_poc = (time_prices.idxmax() + time_prices.iloc[::-1].idxmax()) / 2)

回答1:


you can use this function as a base and to adjust it:

def f(x):                             #function to find the POC price and volume
    a = x['tradePrice'].value_counts().index[0]
    b = x.loc[x['tradePrice'] == a, 'tradeVolume'].sum()
    return pd.Series([a,b],['POC_Price','POC_Volume'])



回答2:


Here's what I worked out. I'm still not sure the answer you code is producing is correct, I think your line volume_prices[_prices] += state.Volume / _prices.size is not being applied to every record in volume_prices, but here it is with benchmarking. About a 9x improvement.

def vpOriginal():
    Step = 0.00001
    Precision = 5
    _frame = getData()
    _low = 85.0
    _high = 116.4
    # Set the complete index of prices +/- 1 step due to weird floating point precision issues
    volume_prices = pd.Series(0, index=np.around(np.arange(_low - Step, _high + Step, Step), decimals=Precision))
    time_prices = volume_prices.copy()
    time_prices2 = volume_prices.copy()
    for index, state in _frame.iterrows():
        _prices = np.around(np.arange(state.Low, state.High, Step), decimals=Precision)

        # Evenly distribute the bar's volume over its range
        volume_prices[_prices] += state.Volume / _prices.size
        # Increment time at price
        time_prices[_prices] += 1
    time_prices2 += 1
    # Pandas only returns the 1st row of the max value,
    # so we need to reverse the series to find the other side
    # and then find the average price between those two extremes
#    print(volume_prices.head(10))
    volume_poc = (volume_prices.idxmax() + volume_prices.iloc[::-1].idxmax() / 2)
    time_poc = (time_prices.idxmax() + time_prices.iloc[::-1].idxmax() / 2)
    return volume_poc, time_poc

def vpNoDF():
    Step = 0.00001
    Precision = 5
    _frame = getData()
    _low = 85.0
    _high = 116.4
    # Set the complete index of prices +/- 1 step due to weird floating point precision issues
    volume_prices = pd.Series(0, index=np.around(np.arange(_low - Step, _high + Step, Step), decimals=Precision))
    time_prices = volume_prices.copy()
    for index, state in _frame.iterrows():
        _prices = np.around((state.High - state.Low) / Step , 0)

        # Evenly distribute the bar's volume over its range
        volume_prices.loc[state.Low:state.High] += state.Volume / _prices
        # Increment time at price
        time_prices.loc[state.Low:state.High] += 1

    # Pandas only returns the 1st row of the max value,
    # so we need to reverse the series to find the other side
    # and then find the average price between those two extremes
    volume_poc = (volume_prices.idxmax() + volume_prices.iloc[::-1].idxmax() / 2)
    time_poc = (time_prices.idxmax() + time_prices.iloc[::-1].idxmax() / 2)
    return volume_poc, time_poc

getData()
Out[8]: 
         Date    Open    High     Low   Close    Volume  Adj Close
0  2008-10-14  116.26  116.40  103.14  104.08  70749800     104.08
1  2008-10-13  104.55  110.53  101.02  110.26  54967000     110.26
2  2008-10-10   85.70  100.00   85.00   96.80  79260700      96.80
3  2008-10-09   93.35   95.80   86.60   88.74  57763700      88.74
4  2008-10-08   85.91   96.33   85.68   89.79  78847900      89.79
5  2008-10-07  100.48  101.50   88.95   89.16  67099000      89.16
6  2008-10-06   91.96   98.78   87.54   98.14  75264900      98.14
7  2008-10-03  104.00  106.50   94.65   97.07  81942800      97.07
8  2008-10-02  108.01  108.79  100.00  100.10  57477300     100.10
9  2008-10-01  111.92  112.36  107.39  109.12  46303000     109.12

vpOriginal()
Out[9]: (142.55000000000001, 142.55000000000001)

vpNoDF()
Out[10]: (142.55000000000001, 142.55000000000001)

%timeit vpOriginal()
2.79 s ± 24.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit vpNoDF()
300 ms ± 8.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)



回答3:


I've managed to get it down to 2 mins instead of 15 - at on daily timeframes anyway. It's still slow on lower timeframes (10 minutes on Hourly over a 2 year period with a precision of 2 for equities). Working with DataFrames as opposed to Series was FAR slower. I'm hoping for more but I don't know what I can do aside from the following solution:

# Upon class instantiation, I've created attributes for each timeframe
# related to `volume_at_price` and `time_at_price`. They serve as memory
# in between frame calculations
def _prices_at(self, frame, bars=0):
    # Include 1 step above high as np.arange does not
    # include the upper limit by default
    state = frame.iloc[-min(bars + 1, frame.index.size)]
    bins = np.around(np.arange(state.low, state.high + self.Step, self.Step), decimals=self.Precision)
    return pd.Series(state.volume / bins.size, index=bins)


# SetFeature/Feature implement timeframed attributes (i.e., 'volume_at_price_D')
_v = 'volume_at_price'
_t = 'time_at_price'

# Add to x_at_price histogram
_p = self._prices_at(frame)
self.SetFeature(_v, self.Feature(_v).add(_p, fill_value=0))
self.SetFeature(_t, self.Feature(_t).add(_p * 0 + 1, fill_value=0))

# Remove old data from histogram
_p = self._prices_at(frame, self.Bars)
v = self.SetFeature(_v, self.Feature(_v).subtract(_p, fill_value=0))
t = self.SetFeature(_t, self.Feature(_t).subtract(_p * 0 + 1, fill_value=0))

self.SetFeature('volume_poc', (v.idxmax() + v.iloc[::-1].idxmax()) / 2)
self.SetFeature('time_poc', (t.idxmax() + t.iloc[::-1].idxmax()) / 2)


来源:https://stackoverflow.com/questions/60578058/efficiently-calculating-point-of-control-with-pandas

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!