I am a PHP newbie and am working on a basic form validation script. I understand that input filtering and output escaping are both vital for security reasons. My question is whe
I understand that input filtering and output escaping are both vital for security reasons.
I'd say rather that output escaping is vital for security and correctness reasons, and input filtering is potentially-useful measure for defence-in-depth and to enforce specific application rules.
The input filtering step and the output escaping step are necessarily separate concerns, and cannot be combined into one step, not least because there are many different types of output escaping, and the right one has to be chosen for each output context (eg HTML-escaping in a page, URL-escaping to make a link, SQL-escaping, and so on).
Unfortunately PHP is traditionally very hazy on these issues and so offers a bunch of mixed-message functions that are likely to mislead you.
In the example field below, the field is plain text, so all I need to do is sanitize it.
Yes. Alas, FILTER_SANITIZE_STRING
is not in any way a sane sanitiser. It completely removes some content (strip_tags
, which is itself highly non-sensible) whilst HTML-escaping other content. eg quotes turn into "
. This is a nonsense.
Instead, for input sanitisation, look at:
checking it's a valid string for the encoding you're using (hopefully UTF-8; see eg this regex for that);
removing control characters, U+0000–U+001F and U+007F–U+009F. Allow the newline through only on deliberate multi-line text fields;
removing the characters that are not suitable for use in markup;
validating the input conforms to application requirements on a field-by-field basis, for data whose content model is more specific than arbitrary text strings. Although your escaping should handle a <
character correctly, it's probably a good idea to get rid of it early in fields where it makes no sense to have one.
For the output escaping step I'd generally prefer htmlspecialchars()
to htmlentities()
, though your correct use of the UTF-8
argument stops the latter function breaking in the way it usually does.