Object detection using environment

问题

I'd like to ask a general question about DNN based object detection algorithms such as Yolo, SSD or R-CNN.

Assume I'd like to detect mobile phones on small images, where - consequently - the mobile devices themselves are super small, moreover, it's nearly impossible to detect them by only looking at those pixels which they appear on. For instance, looking at a 300x300 image, the mobile shows up on a 7x5 grid, so only by looking at the 7x5 picture no one can surely decide what can be seen there.

On the other hand, if we see a subway car on the picture, where a person has something black in her/his hand, we (human beings) are almost sure that the little, black 7x5 grid stands for a mobile device.

Is my understanding right that the current state-of-the-art DNN algorithms cannot capture the environment as humans do, but they only detect objects by their physical appearance on the image? If not, can you suggest an algorithm that does not necessarily learn on a black pixel group only, but is able to capture a human being holding a black thing in her/his hand that is likely to be a phone?

Thanks.

回答1:

My background is not object detection. There exist such contextual information in research. It's a pipeline which has not been solved yet. There are some examples applied to instance segmentation and text caption.

Therefore I assume there is research in object detection giving contextual information.

Anyway, SSD uses a pyramid scheme in which there are contextual information encoded

回答2:

This may be loosely related to tracking algorithms. Typically, you would use a LSTM or other algorithm coupled with a CNN to predict a human's behavior in time series images.

I don't see why you couldn't setup your dataset with target labels of phones vs no phones for the CNN to predict the class label. R-CNN or Yolo won't come out of the box like this so you would need to custom fit your algorithm and training set for this application.

Understanding human behavior is an important and active research topic for deep learning right now. Predicting behavior for a task like this is probably not as widely distributed in common libraries since these could be more domain specific tasks and the research is new, but that doesn't mean it's not possible.

This is a survey paper on this topic that may relate to your question: https://arxiv.org/pdf/1806.11230.pdf. You may also want to look into the research going on with object tracking since it is a similar concept (but covers a wider scope than just detecting what someone is holding).

来源：https://stackoverflow.com/questions/53855354/object-detection-using-environment

标签

neural-network

deep-learning

computer-vision

conv-neural-network

object-detection