How to achieve GPU parallelism using tensor-flow?

问题

I am writing a gpu based string matching program using tensorflow edit distance features. By knowing the matching portion, I will extract the details and then store it to a datatable which eventually will be saved as a csv file. Here are the details:

I have 2 lists. The smaller list is called test_string which contains about 9 words. The larger one is called the ref_string which is basically splitting of a large text file to one word per line. The file was originally a key-value pair. So while splitting the key will be in one line and the value will be in the next line.
I am using muliprocessing/joblib to parallel reads the files and pass the read list as the ref_string list where the edit distance comparison is done in gpu.
There is a total of 4080 text files and each text file contains about 10,000 words when split.
Using the tf edit distance each word is matched to the ref_words. The index where the edit distance becomes zero is noted and then (index+1) is used to extract its value.
System spec: Intel core i5, 12gb ram, Nvidia 940mx with 2gb, Tensorflow 1.10.0, Cuda 9.0, Cudnn 7.1.

A similar program I have done here using cpu and I wanted to see if using gpu can speed up the execution times which can be found here.

Here is the small code snippet:

def main_prog(filenames):
try:
    with open(path+filenames,'r') as f:
        ref_string=f.readlines()
    ref_string=[x.strip() for x in ref_string]
    index=slicer(ref_string)
    ref_string=ref_string[index[0]:(index[1]-1)]

    for i in range(0,len(test_string)):
        test_string1=test_string[i]
        out=[x==test_string1 for x in ref_string]
        out=[i for i, x in enumerate(out) if x]
        if len(out)!=0:
            # Comparing the data using tf with edit distance
             with tf.Session(config=tf.ConfigProto(intra_op_parallelism_threads=10)) as sess:
                test_string_sparse = create_sparse_vec(filler(test_string1,ref_string))
                ref_string_sparse = create_sparse_vec(ref_string)
                out=get_index(sess.run(tf.edit_distance(test_string_sparse, ref_string_sparse, normalize=True)))
                df.set_value(0,test_string1,ref_string[out+1])
        else:
            df.set_value(0,test_string1,"nil")
    return df
except:
    return df

if __name__ == '__main__':
    test_string=["name","Price","oPrice","discount","brand","id","seller","id","category"]
    df=pd.DataFrame(columns=test_string)
    filenames=os.listdir("/home/Desktop/Parallelise/mod_all_page/")
    data=df.append((Parallel(n_jobs=2)(delayed(main_prog)(filenames[i]) for i in range(100))),ignore_index=True)
    data.to_csv("/home/Desktop/final_out.csv")

The code is working but its very slow. I can see the cpu utilization average around 80-90%. While checking the nvidia-smi status there was 2 jobs running and one was consuming close to 1.9gb. After sometimes the program crashes due to memory failure. While testing with around 100 input files I am getting execution times around 70 sec while the cpu version code does 4080 files extraction under 18 sec.

GPU version(tensorflow-gpu) 100 input files : 70 sec.

CPU version(multiprocessing) 4080 input files : 18 sec.

Is there something wrong with the code? Can I make it faster? I have tried with google colab to access the tesla gpu since it has large ram but still, the performance is the same. The code is somewhere not optimized. I will try doing profiling and post the update.

If somebody can point out where I made a mistake it would be really helpful. Thanks!

Update:

I was able to bring down the execution time for the 100 files from 70 sec to 8 sec by increasing the number of n_jobs to 4. But this gives error "CUDA out of memory" when trying the same for a large dataset like 4080 files.

来源：https://stackoverflow.com/questions/54402154/how-to-achieve-gpu-parallelism-using-tensor-flow

标签

python

tensorflow

multiprocessing

gpu

joblib