I am working for international clients who have all very different alphabets and so I am trying to finally get an overview of a complete workflow between PHP and MySQL that would ensure all character encodings to be inserted correctly. I have read a bunch of tutorials on this but still have questions(there is much to learn) and thought I might just put it all together here and ask.
PHP
header('Content-Type:text/html; charset=UTF-8');
mb_internal_encoding('UTF-8');
HTML
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
<form accept-charset="UTF-8"> .. </form>
(though the later is optional and rather a suggestion but I belief I'd rather suggest as not doing anything)
MySQL
CREATE database_name DEFAULT CHARACTER SET utf8; or ALTER database_name DEFAULT CHARACTER SET utf8; and/or use utf8_general_ci as MySQL connection collation.
(it is important to note here that this will increase the database size if it uses varchar)
Connection
mysql_query("SET NAMES 'utf8'");
mysql_query("SET CHARACTER_SET utf8");
Businesses logic
detect if not UTF8 with mb_detect_encoding() and convert with ivon().
validating overly long sequences of UTF8 and UTF16
$body=preg_replace('/[\x00-\x08\x10\x0B\x0C\x0E-\x19\x7F]|(?<=^|[\x00-\x7F])[\x80-\xBF]+|([\xC0\xC1]|[\xF0-\xFF])[\x80-\xBF]*|[\xC2-\xDF]((?![\x80-\xBF])|[\x80-\xBF]{2,})|[\xE0-\xEF](([\x80-\xBF](?![\x80-\xBF]))|(?![\x80-\xBF]{2})|[\x80-\xBF]{3,})/','�',$body);
$body=preg_replace('/\xE0[\x80-\x9F][\x80-\xBF]|\xED[\xA0-\xBF][\x80-\xBF]/S','?', $body);
Questions
is
mb_internal_encoding('UTF-8')necessary in PHP 5.3 and higher and if so does this mean I have to use all multi byte functions instead of its core functions likemb_substr()instead ofsubstr()?is it still necessary to check for malformed input stings and if so what is a reliable function/class to do so? I possibly do not want to strip bad data and don't know enough about transliteration.
should it really be
utf8_general_cior ratherutf8_bin?is there something missing in the above workflow?
sources:
http://coding.smashingmagazine.com/2012/06/06/all-about-unicode-utf8-character-sets/
http://webcollab.sourceforge.net/unicode.html
http://stackoverflow.com/a/3742879/1043231
http://www.adayinthelifeof.nl/2010/12/04/about-using-utf-8-fields-in-mysql/
http://akrabat.com/php/utf8-php-and-mysql/
mb_internal_encoding('UTF-8')doesn't do anything by itself, it only sets the default encoding parameter for eachmb_function. If you're not using anymb_function, it doesn't make any difference. If you are, it makes sense to set it so you don't have to pass the$encodingparameter each time individually.- IMO
mb_detect_encodingis mostly useless since it's fundamentally impossible to accurately detect the encoding of unknown text. You should either know what encoding a blob of text is in because you have a specification about it, or you need to parse appropriate meta data like headers or meta tags where the encoding is specified. - Using
mb_check_encodingto check if a blob of text is valid in the encoding you expect it to be in is typically sufficient. If it's not, discard it and throw an appropriate error. Regarding:
does this mean I have to use all multi byte functions instead of its core functions
If you are manipulating strings that contain multibyte characters, then yes, you need to use the
mb_functions to avoid getting wrong results. The core string functions only work on a byte level, not a character level, which is what you typically want when working with strings.utf8_general_civs.utf8_binonly makes a difference when collating, i.e. sorting and comparing strings. Withutf8_bindata is treated in binary form, i.e. only identical data is identical. Withutf8_general_cisome logic is applied, e.g. "é" sorts together with "e" and upper case is considered equal to lower case.
should it really be utf8_general_ci or rather utf8_bin?
You must use utf8_bin for Case-sensitive search, otherwise utf8_general_ci
is mb_internal_encoding('UTF-8') necessary in PHP 5.3 and higher and if so does this mean I have to use all multi byte functions instead of its core functions like mb_substr() instead of substr()?
Yes of course, If you have a multibyte string you need mb_* family function to work with, except for binary safe php standard function like str_replace(); (and few others)
is it still necessary to check for malformed input stings and if so what is a reliable function/class to do so? I possibly do not want to strip bad data and don't know enough about transliteration.
Hmm, no you can't check it.
来源:https://stackoverflow.com/questions/11013537/utf8-workflow-php-mysql-summarized