UTF-8 is de facto standard for web applications now, but PHP this is not a default encoding for PHP (until 6.0). Most of the server is set up for the ISO-8859-1 encoding by
Some useful options to have in .htaccess
:
########################################
# Locale settings
########################################
# See: http://php.net/manual/en/timezones.php
php_value date.timezone "Europe/Amsterdam"
SetEnv LC_ALL nl_NL.UTF-8
########################################
# Set up UTF-8 encoding
########################################
AddDefaultCharset UTF-8
AddCharset UTF-8 .php
php_value default_charset "UTF-8"
php_value iconv.input_encoding "UTF-8"
php_value iconv.internal_encoding "UTF-8"
php_value iconv.output_encoding "UTF-8"
php_value mbstring.internal_encoding UTF-8
php_value mbstring.http_output UTF-8
php_value mbstring.encoding_translation On
php_value mbstring.func_overload 6
# See also php functions:
# mysql_set_charset
# mysql_client_encoding
# database settings
#CREATE DATABASE db_name
# CHARACTER SET utf8
# DEFAULT CHARACTER SET utf8
# COLLATE utf8_general_ci
# DEFAULT COLLATE utf8_general_ci
# ;
#
#ALTER DATABASE db_name
# CHARACTER SET utf8
# DEFAULT CHARACTER SET utf8
# COLLATE utf8_general_ci
# DEFAULT COLLATE utf8_general_ci
# ;
#ALTER TABLE tbl_name
# DEFAULT CHARACTER SET utf8
# COLLATE utf8_general_ci
# ;
Webserver may be configured to send inappropriate headers, so it's recommended to override them in application level. For instance:
header('Content-Type: text/html; charset=utf-8');
Add HTML meta content-type:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Use htmlspecialchars()
instead of htmlentities()
because the former is enough in utf-8 and the latter is incompatible with utf-8 by default.
For regular expressions use u modifier. For example:
preg_match('/ž{3,5}/u', $string, $matches);
Together this is the most reliable way to check if the given string is valid utf-8 string:
if (@preg_match('//u', $string) === false) {
// NOT valid!
} else {
// Valid!
}
If you use the database then always set appropriate connection encoding right after the connection is made. Example for MySQL:
mysql_set_charset('utf8', $link);
Also check if columns in the database are in utf-8. It's not always needed but recomended.
You're right UTF-8
is a good choice for webapplications.
Encoding is meta-information to the data that get's processed. As long as you know the encoding of the (binary) data, you know what you're dealing with. You start to get lost, if you don't know the encoding. I often call this a chain, if the encoding-chain is broken, the data will be broken. This is both true for displaying data as well as for security.
As a rule of thumb, PHP is binary, it's the context/you who specifies the encoding (e.g. how you save your php source-code files).
So let's tackle a short (and incomplete) list:
Environment variables might tell you about the locale in use and the encoding. File-systems do have their encoding for names of files and directories for example. I'm not very firm to this subject, normally we try to name our files in english so to use only characters in the range of US-ASCII
which is safe for the Latin extended charsets like ISO-8859-1
in your case as well as for UTF-8
.
Just keep this in mind when you save files your users upload: Just filter filenames to basic letters and punctation and you'll have nearly no hassles (a-z
, A-Z
, 0-9
, .
, -
, _
), even make them all lowercase for visual purposes.
If you feel that this degrades usability and the file-system does not offer the unicode range of characters as of UTF-8, you can fallback to simple encodings like rawurlencode (Percent-Encoding, triplet) and offer files to download by resolving that name to disk.
Normally you just need to deal with what you have. Start asking a common sysadmin or programmer about character encoding and most will tell you that they are not really interested. Naturally that's subjective, but if you need someone to configure something for you, this can make a difference.
This is merely independent to PHP, it's about the output your scripts provide so the field of work.
Rule of thumb is: Specify it. If you didn't specifiy it (HTML files, CSS files, Javascript files) don't expect it to work precisely. Just do it then. Encoding is a chain, if there are many components, ensure that each knows about it's encoding. Otherwise browsers can only guess. UTF-8
is a good choice so, but our job is to take care and make this precise and well defined.
As a general rule of thumb, start reading the php.ini
file that ships with the PHP package of your linux distro. It comes with readable documentation in it's comments and further links. Some settings that come to my mind:
default_charset
- PHP always outputs a character encoding by default in the Content-type: header. To disable sending of the charset, simply set it to be empty (Source). For general information see Setting the HTTP charset parameterW3C. If you want to improve your site's output, e.g. for preserving the encoding information when users save the output with their browser, add the HTML http-equiv meta tag as well <meta http-equiv="Content-type" content="text/html;charset=UTF-8">.output_handler
- This setting is worth to look at as it is specifying the output handler (Output Buffering ControlDocs) and each handler (mb
, iconv
) can have it's own encoding settings (see Strings).$binary = (binary) $string;
or $binary = b"binary string";
.ISO-8859-1
but you're looking for UTF-8
. Other functions like html_entity_decodeDocs are using UTF-8
per default. Some like htmlspecialchars_decode do not specify a charset at all, so you need to read the PHP source-code for a concrete specific understanding of how the function deals with the (binary) string.To answer your question: The need of settings and parameters always depend on the components you use. For the general ones like the browser or the webserver, it's possible to give recommendation settings to get it configured for UTF-8
. But with everything else it depends. The most important thing is to look for it and to ensure that you know the encoding and can configure/specify it. Often it's documented. As long as you don't need to deal with portable code, this is much simpler as you have control of the environment or you need to deal with a specific environment only. Write code defensively with encoding in mind and you should be fine.
Basically I do three things to work correctly with czech language:
1) define locale in PHP:
setlocale(LC_COLLATE, "cs_CZ");
setlocale(LC_CTYPE, "cs_CZ");
so you would use something like:
setlocale(LC_ALL, "en_US.utf8");
setlocale(LC_ALL, "nl_NL.utf8");
based on language which is currently switched to.
2) define charset for the database:
mysql_query("set names latin2 collate latin2_czech_cs");
3) define the charset of PHP/HTML code:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-2">
I don't use any .htaccess setting. You can modify this for your case, in locale use something like en_US.utf8
(based on language currently which is currently switched to), in charset use utf-8 instead of latin2/iso-8859-2 and it should work well.
Try one of the following:
AddDefaultCharset UTF-8
AddCharset UTF-8 .php