I want to train a ssd-inception-v2 model from Tensorflow Object Detection API. The training dataset I want to use is a bunch of cropped images with different sizes without b
Object detection algorithms/networks often work by predicting the location of a bounding box as well as the class. For this reason the training data often needs to contain bounding box data. By feeding your model with training data with a bounding box that is always the size of the image then it's likely you'll get garbage predictions out including a box that always outlines the image.
This sounds like a problem with your training data. You shouldn't give cropped images but instead full images/scenes with your object annotated. You're basically training a classifier at this point.
Try training with the correct style of images that are not cropped and see how you get on.