There are a lot of ways of doing this but here's one:
grep '^
' < $FILENAME \
| sed \
-e 's:
::g' \
-e 's:
::g' \
-e 's:::g' \
-e 's:
: :g' \
| cut -c2-
You could use more sed(1) (-e 's:^ ::') instead of the cut -c2- to remove the leading space but cut(1) doesn't get as much love as it deserves. And the backslashes are just there for formatting, you can remove them to get a one liner or leave them in and make sure that they're immediately followed by a newline.
The basic strategy is to slowly pull the HTML apart piece by piece rather than trying to do it all at once with a single incomprehensible pile of regex syntax.
Parsing HTML with a shell pipeline isn't the best idea ever but you can do it if the HTML is known to come in a very specific format. If there will be variation then you'd be better with with a real HTML parser in Perl, Ruby, Python, or even C.