Parallelising gradient calculation in Julia

冷暖自知 提交于 2019-12-04 04:19:37

I would say that GD is not a good candidate for parallelizing it using any of the proposed methods: either SharedArray or DistributedArray, or own implementation of distribution of chunks of data.

The problem does not lay in Julia, but in the GD algorithm. Consider the code:

Main process:

for iter = 1:iterations #iterations: "the more the better"
    δ = _gradient_descent_shared(X, y, θ) 
    θ = θ -  α * (δ/N)   
end

The problem is in the above for-loop which is a must. No matter how good _gradient_descent_shared is, the total number of iterations kills the noble concept of the parallelization.

After reading the question and the above suggestion I've started implementing GD using SharedArray. Please note, I'm not an expert in the field of SharedArrays.

The main process parts (simple implementation without regularization):

run_gradient_descent(X::SharedArray, y::SharedArray, θ::SharedArray, α, iterations) = begin
  N = length(y)

  for iter = 1:iterations 
    δ = _gradient_descent_shared(X, y, θ) 
    θ = θ -  α * (δ/N)   
  end     
  θ
end

_gradient_descent_shared(X::SharedArray, y::SharedArray, θ::SharedArray, op=(+)) = begin            
    if size(X,1) <= length(procs(X))
        return _gradient_descent_serial(X, y, θ)
    else
        rrefs = map(p -> (@spawnat p _gradient_descent_serial(X, y, θ)), procs(X))
        return mapreduce(r -> fetch(r), op, rrefs)
    end 
end

The code common to all workers:

#= Returns the range of indices of a chunk for every worker on which it can work. 
The function splits data examples (N rows into chunks), 
not the parts of the particular example (features dimensionality remains intact).=# 
@everywhere function _worker_range(S::SharedArray)
    idx = indexpids(S)
    if idx == 0
        return 1:size(S,1), 1:size(S,2)
    end
    nchunks = length(procs(S))
    splits = [round(Int, s) for s in linspace(0,size(S,1),nchunks+1)]
    splits[idx]+1:splits[idx+1], 1:size(S,2)
end

#Computations on the chunk of the all data.
@everywhere _gradient_descent_serial(X::SharedArray, y::SharedArray, θ::SharedArray)  = begin
    prange = _worker_range(X)

    pX = sdata(X[prange[1], prange[2]])
    py = sdata(y[prange[1],:])

    tempδ = pX' * (pX * sdata(θ) .- py)         
end

The data loading and training. Let me assume that we have:

  • features in X::Array of the size (N,D), where N - number of examples, D-dimensionality of the features
  • labels in y::Array of the size (N,1)

The main code might look like this:

X=[ones(size(X,1)) X] #adding the artificial coordinate 
N, D = size(X)
MAXITER = 500
α = 0.01

initialθ = SharedArray(Float64, (D,1))
sX = convert(SharedArray, X)
sy = convert(SharedArray, y)  
X = nothing
y = nothing
gc() 

finalθ = run_gradient_descent(sX, sy, initialθ, α, MAXITER);

After implementing this and run (on 8-cores of my Intell Clore i7) I got a very slight acceleration over serial GD (1-core) on my training multiclass (19 classes) training data (715 sec for serial GD / 665 sec for shared GD).

If my implementation is correct (please check this out - I'm counting on that) then parallelization of the GD algorithm is not worth of that. Definitely you might get better acceleration using stochastic GD on 1-core.

If you want to reduce the amount of data movement, you should strongly consider using SharedArrays. You could preallocate just one output vector, and pass it as an argument to each worker. Each worker sets a chunk of it, just as you suggested.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!