How to force tensorflow to use all available GPUs?

问题

I have an 8 GPU cluster and when I run a piece of Tensorflow code (pasted below), it only utilizes a single GPU instead of all 8. I confirmed this using nvidia-smi.

# Set some parameters
IMG_WIDTH = 256
IMG_HEIGHT = 256
IMG_CHANNELS = 3
TRAIN_IM = './train_im/'
TRAIN_MASK = './train_mask/'
TEST_PATH = './test/'

warnings.filterwarnings('ignore', category=UserWarning, module='skimage')
num_training = len(os.listdir(TRAIN_IM))
num_test = len(os.listdir(TEST_PATH))
# Get and resize train images
X_train = np.zeros((num_training, IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS), dtype=np.uint8)
Y_train = np.zeros((num_training, IMG_HEIGHT, IMG_WIDTH, 1), dtype=np.bool)
print('Getting and resizing train images and masks ... ')
sys.stdout.flush()

#load training images
for count, filename in tqdm(enumerate(os.listdir(TRAIN_IM)), total=num_training):
    img = imread(os.path.join(TRAIN_IM, filename))[:,:,:IMG_CHANNELS]
    img = resize(img, (IMG_HEIGHT, IMG_WIDTH), mode='constant', preserve_range=True)
    X_train[count] = img
    name, ext = os.path.splitext(filename)
    mask_name = name + '_mask' + ext
    mask = cv2.imread(os.path.join(TRAIN_MASK, mask_name))[:,:,:1]
    mask = resize(mask, (IMG_HEIGHT, IMG_WIDTH))
    Y_train[count] = mask

# Check if training data looks all right
ix = random.randint(0, num_training-1)
print(ix)
imshow(X_train[ix])
plt.show()
imshow(np.squeeze(Y_train[ix]))
plt.show()
# Define IoU metric
def mean_iou(y_true, y_pred):
    prec = []
    for t in np.arange(0.5, 1.0, 0.05):
        y_pred_ = tf.to_int32(y_pred > t)
        score, up_opt = tf.metrics.mean_iou(y_true, y_pred_, 2)
        K.get_session().run(tf.local_variables_initializer())
        with tf.control_dependencies([up_opt]):
            score = tf.identity(score)
        prec.append(score)
    return K.mean(K.stack(prec), axis=0)

# Build U-Net model
inputs = Input((IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS))
s = Lambda(lambda x: x / 255) (inputs)
width = 64
c1 = Conv2D(width, (3, 3), activation='relu', padding='same') (s)
c1 = Conv2D(width, (3, 3), activation='relu', padding='same') (c1)
p1 = MaxPooling2D((2, 2)) (c1)

c2 = Conv2D(width*2, (3, 3), activation='relu', padding='same') (p1)
c2 = Conv2D(width*2, (3, 3), activation='relu', padding='same') (c2)
p2 = MaxPooling2D((2, 2)) (c2)

c3 = Conv2D(width*4, (3, 3), activation='relu', padding='same') (p2)
c3 = Conv2D(width*4, (3, 3), activation='relu', padding='same') (c3)
p3 = MaxPooling2D((2, 2)) (c3)

c4 = Conv2D(width*8, (3, 3), activation='relu', padding='same') (p3)
c4 = Conv2D(width*8, (3, 3), activation='relu', padding='same') (c4)
p4 = MaxPooling2D(pool_size=(2, 2)) (c4)

c5 = Conv2D(width*16, (3, 3), activation='relu', padding='same') (p4)
c5 = Conv2D(width*16, (3, 3), activation='relu', padding='same') (c5)

u6 = Conv2DTranspose(width*8, (2, 2), strides=(2, 2), padding='same') (c5)
u6 = concatenate([u6, c4])
c6 = Conv2D(width*8, (3, 3), activation='relu', padding='same') (u6)
c6 = Conv2D(width*8, (3, 3), activation='relu', padding='same') (c6)

u7 = Conv2DTranspose(width*4, (2, 2), strides=(2, 2), padding='same') (c6)
u7 = concatenate([u7, c3])
c7 = Conv2D(width*4, (3, 3), activation='relu', padding='same') (u7)
c7 = Conv2D(width*4, (3, 3), activation='relu', padding='same') (c7)

u8 = Conv2DTranspose(width*2, (2, 2), strides=(2, 2), padding='same') (c7)
u8 = concatenate([u8, c2])
c8 = Conv2D(width*2, (3, 3), activation='relu', padding='same') (u8)
c8 = Conv2D(width*2, (3, 3), activation='relu', padding='same') (c8)

u9 = Conv2DTranspose(width, (2, 2), strides=(2, 2), padding='same') (c8)
u9 = concatenate([u9, c1], axis=3)
c9 = Conv2D(width, (3, 3), activation='relu', padding='same') (u9)
c9 = Conv2D(width, (3, 3), activation='relu', padding='same') (c9)

outputs = Conv2D(1, (1, 1), activation='sigmoid') (c9)

model = Model(inputs=[inputs], outputs=[outputs])

sgd = optimizers.SGD(lr=0.03, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(optimizer=sgd, loss='binary_crossentropy', metrics=[mean_iou])
model.summary()

# Fit model
earlystopper = EarlyStopping(patience=20, verbose=1)
checkpointer = ModelCheckpoint('nuclei_only.h5', verbose=1, save_best_only=True)
results = model.fit(X_train, Y_train, validation_split=0.05, batch_size = 32, verbose=1, epochs=100, 
                callbacks=[earlystopper, checkpointer])

I would like to use mxnet or some other method to run this code all available GPUs. However, I'm not sure how to do this. All the resources only show how to do this on mnist data set. I have my own data set that I am reading differently. Hence, not quite sure how to amend the code.

回答1:

TL;DR: Use multi_gpu_model() from Keras.

TS;WM:

From the Tensorflow Guide:

If you have more than one GPU in your system, the GPU with the lowest ID will be selected by default.

If you want to use multiple GPUs, unfortunately you have to manually specify what tensors to put on each GPU like

with tf.device('/device:GPU:2'):

More info in the Tensorflow Guide Using Multiple GPUs.

In terms of how to distribute your network over the multiple GPUs, there are two main ways of doing that.

You distribute your network layer-wise over the GPUs. This is easier to implement but will not yield a lot of performance benefit because the GPUs will wait for each other to complete the operation.
You create separate copies of your network, called "towers" on each GPU. When you feed the octuple network, you break up you input batch into 8 parts, and distribute them. Let the network forward propagate, then sum the gradients, and do the backward propagation. This will result in an almost-linear speedup with the number of GPUs. It's much more difficult to implement, however, because you also have to deal with complexities related to batch normalization, and very advisable to make sure you randomize your batch properly. There is a nice tutorial here. You should also review the Inception V3 code referenced there for ideas how to structure such a thing. Especially _tower_loss(), _average_gradients() and the part of train() starting with for i in range(FLAGS.num_gpus):.

In case you want to give Keras a try, it now has simplified multi-gpu training significantly with multi_gpu_model(). It can do all the heavy lifting for you.

来源：https://stackoverflow.com/questions/50032721/how-to-force-tensorflow-to-use-all-available-gpus

标签

tensorflow

gpu