You have two inputs - the initial image and the user input - and you are looking for a boolean outcome.
Ideally you would convert all your input data to a comparable format. Instead, you could also parameterize both types of input and use a supervised machine learning algorithm (Nearest Neighbor comes to mind for closed shapes).
The trick is in finding the right parameters. If your input is a flat image file, this could be a binary conversion. If user input is a swiping motion or pen stroke, I'm sure there are ways to capture and map this as binary but the algorithm would probably be more robust if it used data closest to the original input.