Stripping HTML tags in PostgreSQL

后端 未结 5 1298
臣服心动
臣服心动 2020-12-06 02:30

How would you strip HTML tags in PostgreSQL such that the data inside the tags is preserved?

I found some solutions by googling it but they were striping the text be

相关标签:
5条回答
  • 2020-12-06 03:24

    Don't do it in postgreSQL.

    It is not designed to do this.

    Use PHP or whatever language you are using to serve webpages.

    Be careful with regular expressions though. HTML is a complex language which cannot be able to be described with regular expressions.

    Use a DOM parser to strip out tags.

    If you use regular expressions, it can be guaranteed that you leave nothing unsafe, but you can easily strip out more than you want, or it may leave malformed tags.

    0 讨论(0)
  • 2020-12-06 03:27

    Use xpath

    Feed your database with XML datatype, not with "second class" TEXT, because is very simple to convert HTML into XHTML (see HTML-Tidy or standard DOM's loadHTML() and saveXML() methods).

    ! IT IS FAST AND IS VERY SAFE !

    The commom information retrieval need, is not a full content, but something into the XHTML, so the power of xpath is wellcome.

    Example: retrive all paragraphs with class="fn":

      WITH needinfo AS (
        SELECT *, xpath('//p[@class="fn"]//text()', xhtml)::text[] as frags
        FROM t 
      ) SELECT array_to_string(frags,' ') AS my_p_fn2txt
        FROM needinfo
        WHERE array_length(frags , 1)>0
      -- for full content use xpath('//text()',xhtml)
    

    regex solutions...

    I not recomend because is not an "information retrieval" solution... and, as @James and others commented here, the regex solution is not so save.

    I like "pure SQL", for me is better than use Perl (se @Daniel's solution) or another.

     CREATE OR REPLACE FUNCTION strip_tags(TEXT) RETURNS TEXT AS $$
         SELECT regexp_replace(
            regexp_replace($1, E'(?x)<[^>]*?(\s alt \s* = \s* ([\'"]) ([^>]*?) \2) [^>]*? >', E'\3'), 
           E'(?x)(< [^>]*? >)', '', 'g')
     $$ LANGUAGE SQL;
    

    See this and many other variations at siafoo.net, eskpee.wordpress, ... and here at Stackoverflow.

    0 讨论(0)
  • 2020-12-06 03:28

    The choice is not limited to doing it server-side with a weak parser based on inadequate regexps or doing it client-side with a robust parser. It could be implemented server-side with a robust parser, too.

    Here's an example in PL/PerlU that takes advantage of the CPAN's HTML modules.

    CREATE FUNCTION extract_contents_from_html(text) returns text AS $$
      use HTML::TreeBuilder;
      use HTML::FormatText;
      my $tree = HTML::TreeBuilder->new;
      $tree->parse_content(shift);
      my $formatter = HTML::FormatText->new(leftmargin=>0, rightmargin=>78);
      $text = $formatter->format($tree);
    $$ LANGUAGE plperlu;
    

    Demo:

    select extract_contents_from_html('<html><body color="white">Hi there!<br>How are you?</body></html>') ;
    

    Output:

         extract_contents_from_html 
        ----------------------------
         Hi there!
         How are you?
    
    

    One needs to be aware of the caveats that come with untrusted languages, though.

    0 讨论(0)
  • 2020-12-06 03:29

    Any solution performed in the RDBMS is going to involve either string handling or regexes: to my knowledge there is NO way to manipulate HTML in a standards-compliant, safe way in the database. To reiterate, what you are asking for is very, VERY unsafe.

    A much better option is to do this in your application. This is application logic, and NOT the job or concern of your storage layer.

    A great way to do this (in PHP, at least) would be HTML purifier. Don't do this in JavaScript, the user can tamper with it very easily.

    0 讨论(0)
  • 2020-12-06 03:29
    select regexp_replace(content, E'<[^>]+>', '', 'gi') from message;
    
    0 讨论(0)
提交回复
热议问题