The initial idea was to start a camera stream via AVCaptureSession, find faces in that raw CMSampleBuffer and then add some images as layers on
You can load your overlay into a CIImage, then use transformed(by matrix: CGAffineTransform) to move it to the face position, and finally use composited(over dest: CIImage) to blend it over the CIImage from the video buffer.
You probably have to put in some work to transfer between the different coordinate spaces.
There are also a lot of more complex compositing filters available. Check out the filters in the CICategoryCompositeOperation category.