You may be able to develop an RL model that chooses which classifier to use. The gt labels being used to train the classifiers and the change in performance of those classifiers being the reward for the RL model. As others have said, it would probably take a very long time to converge, if it ever does. This idea may also require many tricks and tweaks to make it work. I would recommend searching for research papers on this topic.