发表新帖

发表新帖

Stripping HTML tags in PostgreSQL

后端未结

关注

 5  1299

臣服心动 2020-12-06 02:30

How would you strip HTML tags in PostgreSQL such that the data inside the tags is preserved?

I found some solutions by googling it but they were striping the text be

5条回答

时光取名叫无心 (楼主)

2020-12-06 03:27
Use xpath

Feed your database with XML datatype, not with "second class" TEXT, because is very simple to convert HTML into XHTML (see HTML-Tidy or standard DOM's loadHTML() and saveXML() methods).

! IT IS FAST AND IS VERY SAFE !

The commom information retrieval need, is not a full content, but something into the XHTML, so the power of xpath is wellcome.

Example: retrive all paragraphs with class="fn":
```
  WITH needinfo AS (
    SELECT *, xpath('//p[@class="fn"]//text()', xhtml)::text[] as frags
    FROM t 
  ) SELECT array_to_string(frags,' ') AS my_p_fn2txt
    FROM needinfo
    WHERE array_length(frags , 1)>0
  -- for full content use xpath('//text()',xhtml)
```
regex solutions...

I not recomend because is not an "information retrieval" solution... and, as @James and others commented here, the regex solution is not so save.

I like "pure SQL", for me is better than use Perl (se @Daniel's solution) or another.
```
 CREATE OR REPLACE FUNCTION strip_tags(TEXT) RETURNS TEXT AS $$
     SELECT regexp_replace(
        regexp_replace($1, E'(?x)<[^>]*?(\s alt \s* = \s* ([\'"]) ([^>]*?) \2) [^>]*? >', E'\3'), 
       E'(?x)(< [^>]*? >)', '', 'g')
 $$ LANGUAGE SQL;
```
See this and many other variations at siafoo.net, eskpee.wordpress, ... and here at Stackoverflow.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...

热议问题