I have a database with several thousand records, and I need to strip down one of the fields to ensure that it only contains certain characters (Alphanumeric, spaces, and sin
I have created simple function for this
DROP FUNCTION IF EXISTS `regex_replace`$$
CREATE FUNCTION `regex_replace`(pattern VARCHAR(1000),replacement VARCHAR(1000),original VARCHAR(1000)) RETURNS VARCHAR(1000) CHARSET utf8mb4
DETERMINISTIC
BEGIN
DECLARE temp VARCHAR(1000);
DECLARE ch VARCHAR(1);
DECLARE i INT;
SET i = 1;
SET temp = '';
IF original REGEXP pattern THEN
loop_label: LOOP
IF i>CHAR_LENGTH(original) THEN
LEAVE loop_label;
END IF;
SET ch = SUBSTRING(original,i,1);
IF NOT ch REGEXP pattern THEN
SET temp = CONCAT(temp,ch);
ELSE
SET temp = CONCAT(temp,replacement);
END IF;
SET i=i+1;
END LOOP;
ELSE
SET temp = original;
END IF;
RETURN temp;
END
Usage example:
SELECT <field-name> AS NormalText, regex_replace('[^A-Za-z0-9 ]', '', <field-name>)AS RegexText FROM
<table-name>
update mytable
set FieldName = REPLACE(FieldName,'/','')
That's a good place to start.
The Replace() function is first choice. However, Special Characters can sometimes be tricky to write in a console. For those you can combine Replace with the Char() function.
e.g. removing €
Update products set description = replace(description, char(128), '');
You can find all the Ascii values here
Ideally you could do a regex to find all the special chars, but apparently that's not possible with MySQL.
Beyond that, you'd need to run it through your favorite scripting language.
This might be useful.
This solution doesn't involves creating procedures or functions or lengthy use of replace within replace. Instead we know that all the ASCII characters that doesn't involves special character lies within ASCII codes \x20-\x7E (Hex representation). Source ASCII From Wikipedia, the free encyclopedia Below are all those characters in that interval.
Hex: 20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F 30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F 40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F 50 51 52 53 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F 60 61 62 63 64 65 66 67 68 69 6A 6B 6C 6D 6E 6F 70 71 72 73 74 75 76 77 78 79 7A 7B 7C 7D 7E
Glyph: space ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ↑ ← @ a b c d e f g h i j k l m n o p q r s t u v w x y z { ACK } ESC
so as simple regular expression replace will do the job
SELECT REGEXP_REPLACE(columnName, '[^\\x20-\\x7E]', '') from tableName;
PHP Custom query string
$query = "select REGEXP_REPLACE(columnName, '(.*)[(].*[)](.*)', CONCAT('\\\\1', '\\\\2')) `Alias` FROM table_Name";
The above statement replaces the content in between brackets as well as brackets. e.g. if the column contains 'Staff Oreintation (CMST TOT)' then above statement will removes the brackets and its contant i.e. 'Staff Oreintation'.
PS: If you are doing any DML (select, update ...) operation using prepare statement in stored procedure OR through PHP (creating a custom query string); then remember to escape the slash i.e.
SET @sql = CONCAT("SELECT REGEXP_REPLACE(columnName, '[^\\\\x20-\\\\x7E]', '') from tableName");
PREPARE stmt FROM @sql;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;
The above SQL statement does a simple regular expression replaces (actually removes) of all the special character; i.e. In the SQL a REGEX pattern is mention of all the special characters to be replaced with nothing.
Explanation of the pattern
A character group is started with the square bracket. The first character is caret which means; negation of all the characters mention in the group (i.e. with in the squares brackets). This simply means select compliment (other character than those selected) of all the characters in the group.
Just to summarize the above statement will
Unchanged: all the alphanumeric characters, punctuation characters, arithmetic operators.
Remove all the Unicode characters (other than Latin alphabets) or special characters.
Have a look at LIB_MYSQLUDF_PREG which needs to be compiled into the MySQL server, but has advanced regular expression facilities such as preg_replace
which will help with your task.
This may also be useful.
First you have to know the character set of the database and / or of the table. For example, let us suppose you have a UTF-8 environment and you want to strip / remove symbols like circled registered symbols, circled copyright symbol, and registered trademark symbol from a field then search the internet via bing or yahoo or google for for the hex code values of these symbols in the UTF-8 system:
Symbol Utf-8 Hex ======= ========= circled copyright C2A9 circled registered C2AE Trademark (i.e., TM) E284A2
Then your scrubbing select sql for field f1 from table t1, using the hex / unhex facility in conjunction with the replace function, will most likely look like this:
SELECT
cast(unhex(replace(replace(replace(hex(f1),'C2A9',''),'C2AE',''),'E284A2','')) AS char) AS cleanf1
FROM t1
;
Above, note the original field to be scrubbed / cleansed is f1, the table is t1 and the output header is cleanf1. The "as char" casting is necessary because, w/o it, the mysql 5.5.8 on which I tested is returning blob. Hope this helps