问题
I would like to train a deep learning framework (TensorFlow) for object detection with a new object category.
As source for the ground truthing I have multiple video files which contain the object (only part of the image contains the object).
How should I ground truth the video? Should I extract frame by frame and label every frame even when those video frames will be quite similar? Or what would be best practise for such a task?
Open source tools are preferred.
回答1:
It usually works as you described. At lest for the iteration zero:
- collect required examples (video)
- extract valuable frames from the video (manual or partially automated process)
- use OpenCV (or any other tool) to extract required details (bounding box, accurate mask)
- assemble a training set
- train a model
Here is an example of a training set, produced by the approach described above (see it in action)
For iteration one you might use iteration zero models and significantly improve step 2 and step 3 to increase the training set even more.
I'm trying to solve pretty much the same problem, because it is hard to produce a training set to get accurate segmentation:
(again, here it is in action and other examples)
Basically, start with a semi-manual approach and try to evolve.
来源:https://stackoverflow.com/questions/58910721/best-practise-for-video-ground-truthing