Stripping HTML tags in PostgreSQL

后端 未结 5 1329
臣服心动
臣服心动 2020-12-06 02:30

How would you strip HTML tags in PostgreSQL such that the data inside the tags is preserved?

I found some solutions by googling it but they were striping the text be

5条回答
  •  南笙
    南笙 (楼主)
    2020-12-06 03:28

    The choice is not limited to doing it server-side with a weak parser based on inadequate regexps or doing it client-side with a robust parser. It could be implemented server-side with a robust parser, too.

    Here's an example in PL/PerlU that takes advantage of the CPAN's HTML modules.

    CREATE FUNCTION extract_contents_from_html(text) returns text AS $$
      use HTML::TreeBuilder;
      use HTML::FormatText;
      my $tree = HTML::TreeBuilder->new;
      $tree->parse_content(shift);
      my $formatter = HTML::FormatText->new(leftmargin=>0, rightmargin=>78);
      $text = $formatter->format($tree);
    $$ LANGUAGE plperlu;
    

    Demo:

    select extract_contents_from_html('Hi there!
    How are you?') ;

    Output:

         extract_contents_from_html 
        ----------------------------
         Hi there!
         How are you?
    
    

    One needs to be aware of the caveats that come with untrusted languages, though.

提交回复
热议问题