Reading some related questions made me think about the theoretical nature of HTML.
I\'m not talking about XHTML-like code here. I\'m talking about stuff like this cr
Valid HTML is not a context-free language.
First of all, HTML being an application of SGML is fiction for all practical purposes, so analyzing SGML to answer the question is useless. (However, the SGML fiction probably isn't context-free, either.)
It's more useful to look at the actually defined HTML parsing algorithm. It works on two levels: tokenization and tree building. What HTML calls tokenization is a higher-level operation than what is usually called tokenization when talking about parsers. In the case of HTML, tokenization splits a stream of characters into units like start tags, end tags, comments and text. The tokenizer expands character references. Usually, when talking about parsers, you'd probably treat stuff like the less-than sign as "tokens" and would consider character references to consist of tokens instead of being resolved by the tokenizer.
If you consider the process of splitting the input stream into tokens, that level of the HTML language is regular (except for feedback from the tree builder).
However, there are three complications: The first one is that splitting the input stream into tokens is just the first and then there's the tree builder's side that actually cares about the identifiers in the tokens. The second one is that the tree builder feeds back into the tokenizer so that some state transitions made by the tokenizer depend on the state of the tree builder! The third one is that valid documents in the language are defined by rules that apply to the output of the tree builder stage and those rules are complex enough that they can't be fully defined using tree automata (as evidenced by RELAX NG not being expressive enough to describe all the validity constraints).
This isn't an actual proof, but you can probably develop real proofs by working from complications #2 and #3.
Note that the case of invalid documents is not particularly interesting as a question of whether the language is context-free in the sense of there being a context-free grammar that generates all the possible strings with no regard to the parse tree having some intelligible interpretation in terms of the tree that an HTML parser generates. The HTML parser will successfully consume all possible strings, so in that sense, all possible strings are in the "invalid HTML" language.
Edit: Interesting questions left as exercise to the reader:
Is HTML without parse errors but ignoring validity a context-free language?
Is HTML without parse errors and ignoring general validity but with only valid element names allowed a context-free language?
(Complication #2 applies in both cases.)