How would you strip HTML tags in PostgreSQL such that the data inside the tags is preserved?
I found some solutions by googling it but they were striping the text be
The choice is not limited to doing it server-side with a weak parser based on inadequate regexps or doing it client-side with a robust parser. It could be implemented server-side with a robust parser, too.
Here's an example in PL/PerlU that takes advantage of the CPAN's HTML modules.
CREATE FUNCTION extract_contents_from_html(text) returns text AS $$
use HTML::TreeBuilder;
use HTML::FormatText;
my $tree = HTML::TreeBuilder->new;
$tree->parse_content(shift);
my $formatter = HTML::FormatText->new(leftmargin=>0, rightmargin=>78);
$text = $formatter->format($tree);
$$ LANGUAGE plperlu;
Demo:
select extract_contents_from_html('Hi there!
How are you?') ;
Output:
extract_contents_from_html
----------------------------
Hi there!
How are you?
One needs to be aware of the caveats that come with untrusted languages, though.