How to achieve GPU parallelism using tensor-flow?

跟風遠走 提交于 2019-12-02 18:36:09

问题


I am writing a gpu based string matching program using tensorflow edit distance features. By knowing the matching portion, I will extract the details and then store it to a datatable which eventually will be saved as a csv file. Here are the details:

  • I have 2 lists. The smaller list is called test_string which contains about 9 words. The larger one is called the ref_string which is basically splitting of a large text file to one word per line. The file was originally a key-value pair. So while splitting the key will be in one line and the value will be in the next line.

  • I am using muliprocessing/joblib to parallel reads the files and pass the read list as the ref_string list where the edit distance comparison is done in gpu.

  • There is a total of 4080 text files and each text file contains about 10,000 words when split.

  • Using the tf edit distance each word is matched to the ref_words. The index where the edit distance becomes zero is noted and then (index+1) is used to extract its value.

  • System spec: Intel core i5, 12gb ram, Nvidia 940mx with 2gb, Tensorflow 1.10.0, Cuda 9.0, Cudnn 7.1.

A similar program I have done here using cpu and I wanted to see if using gpu can speed up the execution times which can be found here.

Here is the small code snippet:

def main_prog(filenames):
try:
    with open(path+filenames,'r') as f:
        ref_string=f.readlines()
    ref_string=[x.strip() for x in ref_string]
    index=slicer(ref_string)
    ref_string=ref_string[index[0]:(index[1]-1)]

    for i in range(0,len(test_string)):
        test_string1=test_string[i]
        out=[x==test_string1 for x in ref_string]
        out=[i for i, x in enumerate(out) if x]
        if len(out)!=0:
            # Comparing the data using tf with edit distance
             with tf.Session(config=tf.ConfigProto(intra_op_parallelism_threads=10)) as sess:
                test_string_sparse = create_sparse_vec(filler(test_string1,ref_string))
                ref_string_sparse = create_sparse_vec(ref_string)
                out=get_index(sess.run(tf.edit_distance(test_string_sparse, ref_string_sparse, normalize=True)))
                df.set_value(0,test_string1,ref_string[out+1])
        else:
            df.set_value(0,test_string1,"nil")
    return df
except:
    return df

if __name__ == '__main__':
    test_string=["name","Price","oPrice","discount","brand","id","seller","id","category"]
    df=pd.DataFrame(columns=test_string)
    filenames=os.listdir("/home/Desktop/Parallelise/mod_all_page/")
    data=df.append((Parallel(n_jobs=2)(delayed(main_prog)(filenames[i]) for i in range(100))),ignore_index=True)
    data.to_csv("/home/Desktop/final_out.csv")

The code is working but its very slow. I can see the cpu utilization average around 80-90%. While checking the nvidia-smi status there was 2 jobs running and one was consuming close to 1.9gb. After sometimes the program crashes due to memory failure. While testing with around 100 input files I am getting execution times around 70 sec while the cpu version code does 4080 files extraction under 18 sec.

  • GPU version(tensorflow-gpu) 100 input files : 70 sec.
  • CPU version(multiprocessing) 4080 input files : 18 sec.

Is there something wrong with the code? Can I make it faster? I have tried with google colab to access the tesla gpu since it has large ram but still, the performance is the same. The code is somewhere not optimized. I will try doing profiling and post the update.

If somebody can point out where I made a mistake it would be really helpful. Thanks!

Update:

I was able to bring down the execution time for the 100 files from 70 sec to 8 sec by increasing the number of n_jobs to 4. But this gives error "CUDA out of memory" when trying the same for a large dataset like 4080 files.

来源:https://stackoverflow.com/questions/54402154/how-to-achieve-gpu-parallelism-using-tensor-flow

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!