问题
I am writing a gpu based string matching program using tensorflow edit distance features. By knowing the matching portion, I will extract the details and then store it to a datatable which eventually will be saved as a csv file. Here are the details:
I have 2 lists. The smaller list is called test_string which contains about 9 words. The larger one is called the ref_string which is basically splitting of a large text file to one word per line. The file was originally a key-value pair. So while splitting the key will be in one line and the value will be in the next line.
I am using muliprocessing/joblib to parallel reads the files and pass the read list as the ref_string list where the edit distance comparison is done in gpu.
There is a total of 4080 text files and each text file contains about 10,000 words when split.
Using the tf edit distance each word is matched to the ref_words. The index where the edit distance becomes zero is noted and then
(index+1)
is used to extract its value.System spec: Intel core i5, 12gb ram, Nvidia 940mx with 2gb, Tensorflow 1.10.0, Cuda 9.0, Cudnn 7.1.
A similar program I have done here using cpu and I wanted to see if using gpu can speed up the execution times which can be found here.
Here is the small code snippet:
def main_prog(filenames):
try:
with open(path+filenames,'r') as f:
ref_string=f.readlines()
ref_string=[x.strip() for x in ref_string]
index=slicer(ref_string)
ref_string=ref_string[index[0]:(index[1]-1)]
for i in range(0,len(test_string)):
test_string1=test_string[i]
out=[x==test_string1 for x in ref_string]
out=[i for i, x in enumerate(out) if x]
if len(out)!=0:
# Comparing the data using tf with edit distance
with tf.Session(config=tf.ConfigProto(intra_op_parallelism_threads=10)) as sess:
test_string_sparse = create_sparse_vec(filler(test_string1,ref_string))
ref_string_sparse = create_sparse_vec(ref_string)
out=get_index(sess.run(tf.edit_distance(test_string_sparse, ref_string_sparse, normalize=True)))
df.set_value(0,test_string1,ref_string[out+1])
else:
df.set_value(0,test_string1,"nil")
return df
except:
return df
if __name__ == '__main__':
test_string=["name","Price","oPrice","discount","brand","id","seller","id","category"]
df=pd.DataFrame(columns=test_string)
filenames=os.listdir("/home/Desktop/Parallelise/mod_all_page/")
data=df.append((Parallel(n_jobs=2)(delayed(main_prog)(filenames[i]) for i in range(100))),ignore_index=True)
data.to_csv("/home/Desktop/final_out.csv")
The code is working but its very slow. I can see the cpu utilization average around 80-90%. While checking the nvidia-smi status there was 2 jobs running and one was consuming close to 1.9gb. After sometimes the program crashes due to memory failure. While testing with around 100 input files I am getting execution times around 70 sec while the cpu version code does 4080 files extraction under 18 sec.
- GPU version(tensorflow-gpu) 100 input files : 70 sec.
- CPU version(multiprocessing) 4080 input files : 18 sec.
Is there something wrong with the code? Can I make it faster? I have tried with google colab to access the tesla gpu since it has large ram but still, the performance is the same. The code is somewhere not optimized. I will try doing profiling and post the update.
If somebody can point out where I made a mistake it would be really helpful. Thanks!
Update:
I was able to bring down the execution time for the 100 files from 70 sec to 8 sec by increasing the number of n_jobs to 4. But this gives error "CUDA out of memory" when trying the same for a large dataset like 4080 files.
来源:https://stackoverflow.com/questions/54402154/how-to-achieve-gpu-parallelism-using-tensor-flow