Stripping HTML tags in PostgreSQL

后端未结

关注

 5  1313

臣服心动

How would you strip HTML tags in PostgreSQL such that the data inside the tags is preserved?

I found some solutions by googling it but they were striping the text be

相关标签:

5条回答

野的像风

2020-12-06 03:24

Don't do it in postgreSQL.

It is not designed to do this.

Use PHP or whatever language you are using to serve webpages.

Be careful with regular expressions though. HTML is a complex language which cannot be able to be described with regular expressions.

Use a DOM parser to strip out tags.

If you use regular expressions, it can be guaranteed that you leave nothing unsafe, but you can easily strip out more than you want, or it may leave malformed tags.

0 讨论(0)
发布评论:

提交评论
- 加载中...
时光取名叫无心

2020-12-06 03:27
Use xpath

Feed your database with XML datatype, not with "second class" TEXT, because is very simple to convert HTML into XHTML (see HTML-Tidy or standard DOM's loadHTML() and saveXML() methods).

! IT IS FAST AND IS VERY SAFE !

The commom information retrieval need, is not a full content, but something into the XHTML, so the power of xpath is wellcome.

Example: retrive all paragraphs with class="fn":
```
  WITH needinfo AS (
    SELECT *, xpath('//p[@class="fn"]//text()', xhtml)::text[] as frags
    FROM t 
  ) SELECT array_to_string(frags,' ') AS my_p_fn2txt
    FROM needinfo
    WHERE array_length(frags , 1)>0
  -- for full content use xpath('//text()',xhtml)
```
regex solutions...

I not recomend because is not an "information retrieval" solution... and, as @James and others commented here, the regex solution is not so save.

I like "pure SQL", for me is better than use Perl (se @Daniel's solution) or another.
```
 CREATE OR REPLACE FUNCTION strip_tags(TEXT) RETURNS TEXT AS $$
     SELECT regexp_replace(
        regexp_replace($1, E'(?x)<[^>]*?(\s alt \s* = \s* ([\'"]) ([^>]*?) \2) [^>]*? >', E'\3'), 
       E'(?x)(< [^>]*? >)', '', 'g')
 $$ LANGUAGE SQL;
```
See this and many other variations at siafoo.net, eskpee.wordpress, ... and here at Stackoverflow.
0 讨论(0)
发布评论:

提交评论
- 加载中...

南笙

2020-12-06 03:28

The choice is not limited to doing it server-side with a weak parser based on inadequate regexps or doing it client-side with a robust parser. It could be implemented server-side with a robust parser, too.

Here's an example in PL/PerlU that takes advantage of the CPAN's HTML modules.

CREATE FUNCTION extract_contents_from_html(text) returns text AS $$
  use HTML::TreeBuilder;
  use HTML::FormatText;
  my $tree = HTML::TreeBuilder->new;
  $tree->parse_content(shift);
  my $formatter = HTML::FormatText->new(leftmargin=>0, rightmargin=>78);
  $text = $formatter->format($tree);
$$ LANGUAGE plperlu;

Demo:

select extract_contents_from_html('<html><body color="white">Hi there!<br>How are you?</body></html>') ;

Output:

     extract_contents_from_html 
    ----------------------------
     Hi there!
     How are you?

One needs to be aware of the caveats that come with untrusted languages, though.

0 讨论(0)

庸人自扰

2020-12-06 03:29

Any solution performed in the RDBMS is going to involve either string handling or regexes: to my knowledge there is NO way to manipulate HTML in a standards-compliant, safe way in the database. To reiterate, what you are asking for is very, VERY unsafe.

A much better option is to do this in your application. This is application logic, and NOT the job or concern of your storage layer.

A great way to do this (in PHP, at least) would be HTML purifier. Don't do this in JavaScript, the user can tamper with it very easily.

0 讨论(0)
发布评论:

提交评论
- 加载中...
别那么骄傲

2020-12-06 03:29
```
select regexp_replace(content, E'<[^>]+>', '', 'gi') from message;
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

Stripping HTML tags in PostgreSQL

Use xpath

regex solutions...