PostgreSQL various clean up of string \ varchar

牧云@^-^@ 提交于 2019-12-10 15:43:26

问题


I have to clean up some varchar in the following manner:

  1. Remove special characters such as: !, @, #, $, %, ^, &, *, (, ), }, {, [, ], ",", ., ?, /, ', from a closed list. I've managed to do so with a mass use of replace\regexp_replace but I'm looking for something similar to the one in SQL server.

  2. Remove following numbers but not adjacent ones meaning:

    round 1 --> round
    round1 --> round1
    round 12345 --> round
    round12345 --> round12345

  3. Remove words out of a closed list of words such as: "and", "or", "Ltd.", "international" without substrings:

    more food or drinks ---> more food drinks. and not --> me food or drinks

I'm working on a big table and I'd like to do it as efficient as possible.
Should I write a function that does that or is there a more elegant way?


回答1:


I'll play along. Here's for question 2:

SELECT trim(regexp_replace(
   '12 foo1 run 1457 ABC 1Foo 2456 bar 34',
   '\s*\m\d+\M\s*',
   ' ',
   'g'
   ));

Returns:

foo1 run ABC 1Foo bar

I updated the answer to use constraint expressions instead of the bracket expressions, because the manual informs:

The constraint escapes described below are usually preferable; they are no more standard, but are easier to type.

\s* .. zero or more white space
\m .. beginning of a word (same as [[:<:]])
\d+ .. 1 or more digits
\M .. end of a word (same as [[:>:]])

The 4th parameter 'g' is needed to replace "globally", not just the first occurrence.

->sqlfiddle for v9.2
->sqlfiddle for v8.4 doesn't work

Why?

standard_conforming_strings. The default changed with v9.1.

This works in both worlds, "compatibility mode" so to say. But the syntax of the modern version above (in combination with standard_conforming_strings = on) is cleaner.

SELECT trim(regexp_replace(
  '12 foo1 run 1457 ABC 1Foo 2456 bar 34',
  E'\\s*\\m\\d+\\M\\s*',
  ' ',
  'g'
));

->sqlfiddle




回答2:


1. It is much better to replace character which are not in allowed list with spaces, like this:

select regexp_replace(
    E'aśx€ ąsd, dsa w|adwf\n  as dw dgaa[aw] asdd',
    '[^a-zA-Z0-9]',
    ' ',
    'g');

This returns

a x   sd  dsa w adwf   as dw dgaa aw  asdd

There are thousands of possible characters in Unicode — it isn't really possible to list all special characters.

Taking out multiple consecutive spaces left as exercise to the reader.




回答3:


3. I think the fastest way would be:

select regexp_replace(
  'And more food or drinks at the international airport Ltd',
  '[[:<:]](and|or|Ltd|international)[[:>:]]',
  ' ',
  'gi'
);

This returns:

 more food  drinks at the  airport

I assume that Ltd. really is Ltd, as a dot is filtered out in first answer.



来源:https://stackoverflow.com/questions/14412635/postgresql-various-clean-up-of-string-varchar

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!