I\'m working on a deep neural network that works with different sources of data (different modalities, i.e. audio, video, etc...). The network is composed of three "blocks&