Clustering human faces from a video

丶灬走出姿态 提交于 2021-02-07 03:43:02

问题


I have run the face detection algorithm inbuilt in opencv to extract faces in each frame of a video(sampled at 1 fps). I have also resized each face image to be of same size and I have cropped some fraction of image to remove background noise and hair. Now the problem is that I have to cluster these images of faces - Each cluster corresponding to a person. I implemented the algorithm described here http://bitsearch.blogspot.in/2013/02/unsupervised-face-clustering-with-opencv.html

Basically the above algorithm, uses LBPH face recognizer of OpenCV iteratively to cluster the images. In the description on that page itself the results are not satisfactory. In my implementation the results are worse. Can anyone suggest a better way to cluster faces? May be using some other feature and some other clustering algorithm. The number of clusters are unknown.


回答1:


I suggest having a look at

FaceNet: A Unified Embedding for Face Recognition and Clustering

My shortscience summary (go there if you want to see the Math parts rendered correctly):

FaceNet directly maps face images to $\mathbb{R}^{128}$ where distances directly correspond to a measure of face similarity. They use a triplet loss function. The triplet is (face of person A, other face of person A, face of person which is not A). Later, this is called (anchor, positive, negative).

The loss function is learned and inspired by LMNN. The idea is to minimize the distance between the two images of the same person and maximize the distance to the other persons image.

LMNN

Large Margin Nearest Neighbor (LMNN) is learning a pseudo-metric

$$d(x, y) = (x -y) M (x -y)^T$$

where $M$ is a positive-definite matrix. The only difference between a pseudo-metric and a metric is that $d(x, y) = 0 \Leftrightarrow x = y$ does not hold.

Curriculum Learning: Triplet selection

Show simple examples first, then increase the difficulty. This is done by selecting the triplets.

They use the triplets which are hard. For the positive example, this means the distance between the anchor and the positive example is high. For the negative example this means the distance between the anchor and the negative example is low.

They want to have

$$||f(x_i^a) - f(x_i^p)||_2^2 + \alpha < ||f(x_i^a) - f(x_i^n)||_2^2$$

where $\alpha$ is a margin and $x_i^a$ is the anchor, $x_i^p$ is the positive face example and $x_i^n$ is the negative example. They increase $\alpha$ over time. It is crucial that $f$ maps the images not in the complete $\mathbb{R}^{128}$, but on the unit sphere. Otherwise one could double $\alpha$ by simply making $f' = 2 \cdot f$.

Tasks

  • Face verification: Is this the same person?
  • Face recognition: Who is this person?

Datasets

  • 99.63% accuracy on Labeled FAces in the Wild (LFW)
  • 95.12% accuracy on YouTube Faces DB

Network

Two models are evaluated: The Zeiler & Fergus model and an architecture based on the Inception model.

See also

  • DeepFace

See also

  • DeepFace: Closing the Gap to Human-Level Performance in Face Verification


来源:https://stackoverflow.com/questions/26179052/clustering-human-faces-from-a-video

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!