Optimize this function with numpy (or other vectorization methods)

独自空忆成欢 提交于 2019-12-10 18:37:46

问题


I am computing with Python a classic calculation in the field of population genetics. I am well aware that there exists many algorithm that do the job but I wanted to build my own for some reason.

The below paragraph is a picture because MathJax is not supported on StackOverflow

I would like to have an efficient algorithm to calculate those Fst. For the moment I only manage to make for loops and no calculations are vectorized How can I make this calculation using numpy (or other vectorization methods)?


Here is a code that I think should do the job:

def Fst(W, p):
    I = len(p[0])
    K = len(p)
    H_T = 0
    H_S = 0
    for i in xrange(I):
        bar_p_i = 0
        for k in xrange(K):
            bar_p_i += W[k] * p[k][i]
            H_S += W[k] * p[k][i] * p[k][i]
        H_T += bar_p_i*bar_p_i
    H_T = 1 - H_T
    H_S = 1 - H_S
    return (H_T - H_S) / H_T

def main():
    W = [0.2, 0.1, 0.2, 0.5]
    p = [[0.1,0.3,0.6],[0,0,1],[0.4,0.5,0.1],[0,0.1,0.9]]
    F = Fst(W,p)
    print("Fst = " + str(F))
    return

main()

回答1:


There's no reason to use loops here. And you really shouldn't use Numba or Cython for this stuff - linear algebra expressions like the one you have are the whole reason behind vectorized operations in Numpy.

Since this type of problem is going to pop up again and again if you keep using Numpy, I would recommend getting a basic handle on linear algebra in Numpy. You might find this book chapter helpful:

https://www.safaribooksonline.com/library/view/python-for-data/9781449323592/ch04.html

As for your specific situation: start by creating numpy arrays from your variables:

import numpy as np
W = np.array(W)
p = np.array(p)

Now, your \bar p_i^2 are defined by a dot product. That's easy:

bar_p_i = p.T.dot(W)

Note the T, for the transpose, because the dot product takes the sum of the elements indexed by the last index of the first matrix and the first index of the second matrix. The transpose inverts the indices so the first index becomes the last.

You H_t is defined by a sum. That's also easy:

H_T = 1 - bar_p_i.sum()

Similarly for your H_S:

H_S = 1 - ((bar_p_i**2).T.dot(W)).sum()


来源:https://stackoverflow.com/questions/31168115/optimize-this-function-with-numpy-or-other-vectorization-methods

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!