Ok, so I have been reading about markdown here on SO and elsewhere and the steps between user-input and the db are usually given as
- convert markdown to html
- sanitize html (w/whitelist)
- insert into database
Here, the assumptions are
- sanitize markdown (remove all tags - no exceptions)
- convert to html
- insert into database
Here the assumptions are
The markdown sanitizer has to know not just about dangerous HTML and dangerous markdown, but how the markdown->HTML converter does its job. That makes it more complex, and more likely to be wrong than the simpler unsafeHTML->safeHTML function above.
As a concrete example, "remove all tags" assumes you can identify tags, and would not work against UTF-7 attacks. There might be other encoding attacks out there that render this assumption moot, or there might be a bug that causes the markdown->HTML program to convert (full-width '<', exotic white-space characters stripped by markdown, SCRIPT) into a tag.
The most secure would be:
That way, when you update your HTML sanitizer you get protection against any newly discovered attacks. This is often inefficient, but you can get pretty good security by storing a timestamp with HTML inserted so that you can tell which might have been inserted during the time when someone knew about an attack that gets past your sanitizer.