PostgreSQL - Replace HTML Entities

前端 未结 3 1032
闹比i
闹比i 2020-12-11 20:36

I have just set about the task of stripping out HTML entities from our database, as we do a lot of crawling and some of the crawlers didn\'t do this at input time :(

相关标签:
3条回答
  • 2020-12-11 21:04

    You could use xpath (HTML-encoded content is the same as XML encoded content):

    select 
      'AT&T' as input ,
      (xpath('/z/text()', ('<z>' || 'AT&amp;T' || '</z>')::xml))[1] as output 
    
    0 讨论(0)
  • 2020-12-11 21:10

    This is what it took for me to get working on Ubuntu 18.04 with PG10, and Perl didn't decode some entities like &comma; for some reason. So I used Python3.

    From the command line

    sudo apt install postgresql-plpython3-10
    

    From your SQL interface:

    CREATE LANGUAGE plpython3u;
    
    CREATE OR REPLACE  FUNCTION htmlchars(str TEXT) RETURNS TEXT AS $$
        from html.parser import HTMLParser
        h = HTMLParser() 
        if str is None:
            return str
        return h.unescape(str);
    $$ LANGUAGE plpython3u;
    
    0 讨论(0)
  • 2020-12-11 21:20

    Write a function using pl/perlu and use this module https://metacpan.org/pod/HTML::Entities

    Of course you need to have perl installed and pl/perl available.

    1) First of all create the procedural language pl/perlu:

    CREATE EXTENSION plperlu;
    

    2) Then create a function like this:

    CREATE FUNCTION decode_html_entities(text) RETURNS TEXT AS $$
        use HTML::Entities;
        return decode_entities($_[0]);
    $$ LANGUAGE plperlu;
    

    3) Then you can use it like this:

    select decode_html_entities('aaabbb&amp;.... asasdasdasd &hellip;');
       decode_html_entities    
    ---------------------------
     aaabbb&.... asasdasdasd …
    (1 row)
    
    0 讨论(0)
提交回复
热议问题