How would you strip HTML tags in PostgreSQL such that the data inside the tags is preserved?
I found some solutions by googling it but they were striping the text be
Don't do it in postgreSQL.
It is not designed to do this.
Use PHP or whatever language you are using to serve webpages.
Be careful with regular expressions though. HTML is a complex language which cannot be able to be described with regular expressions.
Use a DOM parser to strip out tags.
If you use regular expressions, it can be guaranteed that you leave nothing unsafe, but you can easily strip out more than you want, or it may leave malformed tags.
Feed your database with XML datatype, not with "second class" TEXT, because is very simple to convert HTML into XHTML (see HTML-Tidy or standard DOM's loadHTML()
and saveXML()
methods).
! IT IS FAST AND IS VERY SAFE !
The commom information retrieval need, is not a full content, but something into the XHTML, so the power of xpath
is wellcome.
Example: retrive all paragraphs with class="fn"
:
WITH needinfo AS (
SELECT *, xpath('//p[@class="fn"]//text()', xhtml)::text[] as frags
FROM t
) SELECT array_to_string(frags,' ') AS my_p_fn2txt
FROM needinfo
WHERE array_length(frags , 1)>0
-- for full content use xpath('//text()',xhtml)
I not recomend because is not an "information retrieval" solution... and, as @James and others commented here, the regex solution is not so save.
I like "pure SQL", for me is better than use Perl (se @Daniel's solution) or another.
CREATE OR REPLACE FUNCTION strip_tags(TEXT) RETURNS TEXT AS $$
SELECT regexp_replace(
regexp_replace($1, E'(?x)<[^>]*?(\s alt \s* = \s* ([\'"]) ([^>]*?) \2) [^>]*? >', E'\3'),
E'(?x)(< [^>]*? >)', '', 'g')
$$ LANGUAGE SQL;
See this and many other variations at siafoo.net, eskpee.wordpress, ... and here at Stackoverflow.
The choice is not limited to doing it server-side with a weak parser based on inadequate regexps or doing it client-side with a robust parser. It could be implemented server-side with a robust parser, too.
Here's an example in PL/PerlU that takes advantage of the CPAN's HTML modules.
CREATE FUNCTION extract_contents_from_html(text) returns text AS $$
use HTML::TreeBuilder;
use HTML::FormatText;
my $tree = HTML::TreeBuilder->new;
$tree->parse_content(shift);
my $formatter = HTML::FormatText->new(leftmargin=>0, rightmargin=>78);
$text = $formatter->format($tree);
$$ LANGUAGE plperlu;
Demo:
select extract_contents_from_html('<html><body color="white">Hi there!<br>How are you?</body></html>') ;
Output:
extract_contents_from_html ---------------------------- Hi there! How are you?
One needs to be aware of the caveats that come with untrusted languages, though.
Any solution performed in the RDBMS is going to involve either string handling or regexes: to my knowledge there is NO way to manipulate HTML in a standards-compliant, safe way in the database. To reiterate, what you are asking for is very, VERY unsafe.
A much better option is to do this in your application. This is application logic, and NOT the job or concern of your storage layer.
A great way to do this (in PHP, at least) would be HTML purifier. Don't do this in JavaScript, the user can tamper with it very easily.
select regexp_replace(content, E'<[^>]+>', '', 'gi') from message;