问题
Assuming I have a the following problem:
import pandas as pd
import numpy as np
xp = [0.0, 0.5, 1.0]
np.random.seed(100)
df = pd.DataFrame(np.random.rand(10, 4), columns=['x0', 'y1', 'y2', 'y3'])
df
x0 y1 y2 y3
0 0.5434 0.2784 0.4245 0.8448
1 0.0047 0.1216 0.6707 0.8259
2 0.1367 0.5751 0.8913 0.2092
3 0.1853 0.1084 0.2197 0.9786
4 0.8117 0.1719 0.8162 0.2741
5 0.4317 0.9400 0.8176 0.3361
6 0.1754 0.3728 0.0057 0.2524
7 0.7957 0.0153 0.5988 0.6038
8 0.1051 0.3819 0.0365 0.8904
9 0.9809 0.0599 0.8905 0.5769
I would like to interpolate a column named interp. The value x-coordinate to be interpolated is contained in column x0, the x-coordinate of the data points would be xp, and the y-coordinates of the data points would be contained in y1, y2 and y3.
So far, I came up with the following:
df['interp'] = df.apply(lambda x: np.interp(x.x0, xp, [x.y1, x.y2, x.y3]), axis=1)
df
x0 y1 y2 y3 interp
0 0.5434 0.2784 0.4245 0.8448 0.4610
1 0.0047 0.1216 0.6707 0.8259 0.1268
2 0.1367 0.5751 0.8913 0.2092 0.6616
3 0.1853 0.1084 0.2197 0.9786 0.1496
4 0.8117 0.1719 0.8162 0.2741 0.4783
5 0.4317 0.9400 0.8176 0.3361 0.8344
6 0.1754 0.3728 0.0057 0.2524 0.2440
7 0.7957 0.0153 0.5988 0.6038 0.6018
8 0.1051 0.3819 0.0365 0.8904 0.3093
9 0.9809 0.0599 0.8905 0.5769 0.5889
However, the dataframe on which this calculation will be performed contains more than a million lines, so I'd like to privilege a faster method than apply. Any ideas?
np.interp only seems to take 1-D arrays, and this is the reason I went with apply.
回答1:
One good solution for making this faster is pandas.DataFrame.eval():
TL;DR
Seconds per number of rows
Rows: 100 1000 10000 1E5 1E6 1E7
apply: 0.076 0.734 7.812
eval: 0.056 0.053 0.058 0.087 0.338 2.887
As can be seen from these timings, eval() has a lot of setup overhead, and up to 10,000 rows basically takes the same time. But it is two orders of magnitude faster than the apply, and thus it certainly worth the overhead for large data sets.
What is it?
From the (DOCS)
pandas.eval(expr, parser='pandas', engine=None, truediv=True,
local_dict=None, global_dict=None, resolvers=(),
level=0, target=None, inplace=None)
Evaluate a Python expression as a string using various backends.
The following arithmetic operations are supported: +, -, *, /, ** , %, // (python engine only) along with the following boolean operations: | (or), & (and), and ~ (not). Additionally, the 'pandas' parser allows the use of and, or, and not with the same semantics as the corresponding bitwise operators. Series and DataFrame objects are supported and behave as they would with plain ol’ Python evaluation.
Tricks performed for this Question:
The code below exploits the fact that the interpolation is always only in two segments. It actually calculates the interpolant for both segments, and then discards the unused segment via a multiply by a bool test (ie, 0, 1)
The actual expression passed to eval is:
((y2-y1) / 0.5 * (x0-0.0) + y1) * (x0 < 0.5)+((y3-y2) / 0.5 * (x0-0.5) + y2) * (x0 >= 0.5)
Code:
import pandas as pd
import numpy as np
xp = [0.0, 0.5, 1.0]
np.random.seed(100)
def method1():
df['interp'] = df.apply(
lambda x: np.interp(x.x0, xp, [x.y1, x.y2, x.y3]), axis=1)
def method2():
exp = '((y%d-y%d) / %s * (x0-%s) + y%d) * (x0 %s 0.5)'
exp_1 = exp % (2, 1, xp[1] - xp[0], xp[0], 1, '<')
exp_2 = exp % (3, 2, xp[2] - xp[1], xp[1], 2, '>=')
df['interp2'] = df.eval(exp_1 + '+' + exp_2)
from timeit import timeit
def runit(stmt):
print("%s: %.3f" % (
stmt, timeit(stmt + '()', number=10,
setup='from __main__ import ' + stmt)))
def runit_size(size):
global df
df = pd.DataFrame(
np.random.rand(size, 4), columns=['x0', 'y1', 'y2', 'y3'])
print('Rows: %d' % size)
if size <= 10000:
runit('method1')
runit('method2')
for i in (100, 1000, 10000, 100000, 1000000, 10000000):
runit_size(i)
print(df.head())
Results:
x0 y1 y2 y3 interp interp2
0 0.060670 0.949837 0.608659 0.672003 0.908439 0.908439
1 0.462774 0.704273 0.181067 0.647582 0.220021 0.220021
2 0.568109 0.954138 0.796690 0.585310 0.767897 0.767897
3 0.455355 0.738452 0.812236 0.927291 0.805648 0.805648
4 0.826376 0.029957 0.772803 0.521777 0.608946 0.608946
来源:https://stackoverflow.com/questions/43765796/interpolating-values-from-a-dataframe-based-on-a-column-value