I have just set about the task of stripping out HTML entities from our database, as we do a lot of crawling and some of the crawlers didn\'t do this at input time :(
You could use xpath (HTML-encoded content is the same as XML encoded content):
select
'AT&T' as input ,
(xpath('/z/text()', ('<z>' || 'AT&T' || '</z>')::xml))[1] as output
This is what it took for me to get working on Ubuntu 18.04 with PG10, and Perl didn't decode some entities like ,
for some reason. So I used Python3.
From the command line
sudo apt install postgresql-plpython3-10
From your SQL interface:
CREATE LANGUAGE plpython3u;
CREATE OR REPLACE FUNCTION htmlchars(str TEXT) RETURNS TEXT AS $$
from html.parser import HTMLParser
h = HTMLParser()
if str is None:
return str
return h.unescape(str);
$$ LANGUAGE plpython3u;
Write a function using pl/perlu and use this module https://metacpan.org/pod/HTML::Entities
Of course you need to have perl installed and pl/perl available.
1) First of all create the procedural language pl/perlu:
CREATE EXTENSION plperlu;
2) Then create a function like this:
CREATE FUNCTION decode_html_entities(text) RETURNS TEXT AS $$
use HTML::Entities;
return decode_entities($_[0]);
$$ LANGUAGE plperlu;
3) Then you can use it like this:
select decode_html_entities('aaabbb&.... asasdasdasd …');
decode_html_entities
---------------------------
aaabbb&.... asasdasdasd …
(1 row)