Stripping HTML tags in PostgreSQL

后端 未结 5 1299
臣服心动
臣服心动 2020-12-06 02:30

How would you strip HTML tags in PostgreSQL such that the data inside the tags is preserved?

I found some solutions by googling it but they were striping the text be

5条回答
  •  时光取名叫无心
    2020-12-06 03:27

    Use xpath

    Feed your database with XML datatype, not with "second class" TEXT, because is very simple to convert HTML into XHTML (see HTML-Tidy or standard DOM's loadHTML() and saveXML() methods).

    ! IT IS FAST AND IS VERY SAFE !

    The commom information retrieval need, is not a full content, but something into the XHTML, so the power of xpath is wellcome.

    Example: retrive all paragraphs with class="fn":

      WITH needinfo AS (
        SELECT *, xpath('//p[@class="fn"]//text()', xhtml)::text[] as frags
        FROM t 
      ) SELECT array_to_string(frags,' ') AS my_p_fn2txt
        FROM needinfo
        WHERE array_length(frags , 1)>0
      -- for full content use xpath('//text()',xhtml)
    

    regex solutions...

    I not recomend because is not an "information retrieval" solution... and, as @James and others commented here, the regex solution is not so save.

    I like "pure SQL", for me is better than use Perl (se @Daniel's solution) or another.

     CREATE OR REPLACE FUNCTION strip_tags(TEXT) RETURNS TEXT AS $$
         SELECT regexp_replace(
            regexp_replace($1, E'(?x)<[^>]*?(\s alt \s* = \s* ([\'"]) ([^>]*?) \2) [^>]*? >', E'\3'), 
           E'(?x)(< [^>]*? >)', '', 'g')
     $$ LANGUAGE SQL;
    

    See this and many other variations at siafoo.net, eskpee.wordpress, ... and here at Stackoverflow.

提交回复
热议问题