How would you strip HTML tags in PostgreSQL such that the data inside the tags is preserved?
I found some solutions by googling it but they were striping the text be
Feed your database with XML datatype, not with "second class" TEXT, because is very simple to convert HTML into XHTML (see HTML-Tidy or standard DOM's loadHTML()
and saveXML()
methods).
! IT IS FAST AND IS VERY SAFE !
The commom information retrieval need, is not a full content, but something into the XHTML, so the power of xpath
is wellcome.
Example: retrive all paragraphs with class="fn"
:
WITH needinfo AS (
SELECT *, xpath('//p[@class="fn"]//text()', xhtml)::text[] as frags
FROM t
) SELECT array_to_string(frags,' ') AS my_p_fn2txt
FROM needinfo
WHERE array_length(frags , 1)>0
-- for full content use xpath('//text()',xhtml)
I not recomend because is not an "information retrieval" solution... and, as @James and others commented here, the regex solution is not so save.
I like "pure SQL", for me is better than use Perl (se @Daniel's solution) or another.
CREATE OR REPLACE FUNCTION strip_tags(TEXT) RETURNS TEXT AS $$
SELECT regexp_replace(
regexp_replace($1, E'(?x)<[^>]*?(\s alt \s* = \s* ([\'"]) ([^>]*?) \2) [^>]*? >', E'\3'),
E'(?x)(< [^>]*? >)', '', 'g')
$$ LANGUAGE SQL;
See this and many other variations at siafoo.net, eskpee.wordpress, ... and here at Stackoverflow.