MySQL 4.1 and charsets
When using MySQL 4.1 and above every now and then problems occur when displaying web sites with different charsets. This is due to the most important innovation from MySQL 4.1 on: being Unicode compatible (as opposed to the previous versions). Thereby it is far more easier to provide multilingual web sites.
The right charset in HTML documents
A widely spread and often sufficient charset (e.g. for West European content) is ISO-8859-1 (also known as "Latin-1"). It derives from the ISO-8859 familiy and contains among others the German umlauts and some relevant special characters. Is the document delivered with this charset all contained characters that play a special role in HTML (e.g. <, > and &) can be used directly in the HTML source code.Since ISO-8859-1 uses only one byte to encode a character only 256 characters can be displayed. But what is to do if you want to display characters on your web site that are not included in the selected charset? One way is to paste these characters by means of a special numeric notation. With this notation every character from the Unicode character set is displayable. The large greek Sigma (?) for instance isn't included in ISO-8859-1, so it can't be used in the source code and the browser would use another character instead. To display it however said numeric notation could be used instead (in this case »Σ«). This practice is however, if any, only useful for certain characters.
UTF-8 for multilingual web sites
In case whole sentences of another language are to be displayed a charset with a bigger character set is quite useful. UTF-8 for instance doesn't encode characters in one byte but a variable number of bytes - thereby all characters of the Unicode system are displayable. So they can be noted directly in the HTML source code as far as the used editor supports UTF-8.
The information which charset is required to decode the received resource is retrieved from the web server by the requesting browser. To this the web server has to know that this information has to be sent also which can happen in various ways:
- in PHP the function header() is provided. By means of »header('Content-Type: text/html; charset=utf-8');« the web server is told the used MIME-Typ and charset.
- in a .htaccess file you can tell the web server to deliver certain resources with a defined charset, using AddCharset. So the line »AddCharset utf-8 .html .php« causes the web server to deliver all documents with the file endings html and php as resources with the UTF-8 charset. If you want to specify all the responses that are delivered with the MIME types text/plain and text/html with UTF-8 just add a line that contains the instruction AddDefaultCharset: »AddDefaultCharset utf-8«
In order that after saving the document still the right charset is used there also exists the meta tag »http-equiv="content-type"« (to be noted in the header section). This should be interpreted by the web server but only if no HTTP header is specified - it would make no sense to specify a different statement. In cases of conflict the statement in the HTTP header should be preferred by the browser which could lead to the oddness that the same document (delivered by the web server) would be - as opposed to a local call up - decoded differently. Since called up locally no HTTP header is available for the browser and it would comply with the statement in the meta tag. The information whether a (and which) charset statement is included in the HTTP header can be obtained in different ways: many browsers are showing the encoding statements delivered by the server in their page information. Moreover there are so called HTTP trace services like the Web-Sniffer which show the header information.
The difference between MySQL =<4.0 and =>4.1
MySQL 4.0 and previous versions only supported charsets whose characters were encoded with 8 Bit each. The used charset was determined on server level and complied with Latin-1 by default. Neither information about the used charset and sorting directives of the data were saved, nor did MySQL handle a conversion of data into other charsets. However, being able to work with UTF-8 the UTF-8 byte sequences were just saved, e.g. „Müller” instead of „Müller” since the ü is decoded with two bytes (188 and 195) which in turn are decoded as à and ¼. As a result some drawbacks appeared: the data was not sorted correctly anymore, the full-text search wasn't completely usable anymore and much more. Since MySQL 4.1 there is far more support for different charsets. The data are internally encoded with UTF-8 by default, while the data transferred between server and client is still using Latin-1. For a single-language web site that uses ISO-8859-1 nothing changes at first. If the web site has to be in UTF-8 you have to tell the database server that the data have to be received and sent in UTF-8 encoding. This is achieved with the SQL instruction »SET NAMES 'utf8'« or »SET CHARACTER SET utf8«. The same statements have to be made when importing / exporting data (e.g. via phpMyAdmin). For DB-Dumps created by a MySQL 4.0 server the charset of the data is selected under "charset encoding of the file".
"Müller" instead of "Müller"
Nonetheless it can happen that MySQL 4.1-based web applications spoil umlauts. This is usually caused by data that has not been transferred or saved correctly. If we have an HTML form for instance, the browser has to be told which charset it is allowed to accept for this form. If the web site is to be delivered in ISO-8859-1 encoding but the browser is sending UTF-8 by mistake, the abovementioned deviation occurs: the browser is sending the byte sequence that matches „Müller” in ISO-8859-1. In case this byte sequence gets into the database without validation the error continues to occur. There are various ways to prevent this:
- the browser must only accept ISO-8859-1: in the introductory form element the attribute accept-charset with the value ISO-8859-1 is used, e.g.:
<form action="save.php" method="post" accept-charset="ISO-8859-1">
This method has the drawback that some browsers are sending question marks for characters entered outside of ISO-8859-1. The saving procedure in a processing PHP script could look like this:<?php
$link = mysql_connect('host', 'user', 'pass')
or die ('Connecting to the DB not possible');
mysql_query('SET NAMES \'latin1\'', $link);
mysql_query('INSERT INTO database.table (column) ' .
'VALUES (\'' . mysql_real_escape_string($data, $link) . '\')', $link)
or die('INSERT failed: ' . mysql_error($link));
?> - you generally accept and save UTF-8 by using »accept-charset="UTF-8"« in the form element and the instruction »SET NAMES 'utf8'« in MySQL. But here it's also likely that characters get lost during the delivery with Latin-1 since Latin-1 has a smaller character set than UTF-8.
The sorting (collation) of the data
The default setting for the sorting method on our servers corresponds to „latin1_swedish_ci”. This provides the sorting method mostly expected by the customers. That doesn't apply to the use of UTF-8 since there are special sorting methods. When using the default „utf8_general_ci” sorting apparently happens the "wrong" way: umlauts and the ß are moved backwards since the sorting happens by the byte values and the mentioned characters are arranged after the z. Unfortunately MySQL doesn't provide a German sorting directive for UTF-8 data by default. However it is possible to fix this with a trick - by converting the data for the sorting process in the desired charset with the desired sorting directive:
SELECT utf8_column FROM table
ORDER BY CONVERT(utf8_column USING latin1) COLLATE latin1_german2_ci ASC
For sorting the UTF-8 values are converted into the charset Latin-1 with the sorting directive latin1_german2_ci, thus taking care that ä, ö, ü and ß are treated like ae, oe, ue and ss during the sorting.
Examples
- A page with exlusive chinese content (e.g. GB2312) could be constructed as follows:
<?php header('Content-Type: text/html; charset=GB2312'); ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="zh" lang="zh">
<head>
<title>GB2312-Example</title>
<meta http-equiv="Content-Type" content="text/html; charset=GB2312" />
<meta http-equiv="Content-Language" content="zh" />
<meta name="DC.language" scheme="DCTERMS.RFC3066" content="zh" />
<!-- further statements in the header... -->
</head>
<body>
<!-- any content, chinese characters can be noted directly in the source code -->
<?php
/**
* now some dynamic content
*/
$link = mysql_connect('host', 'user', 'pass')
or die('No Connection');
// we want to speak only gb2312 with the MySQL-Server (from 4.1 on)
mysql_query('SET NAMES \'gb2312\'', $link);
$result = mysql_query('SELECT gb2312_column FROM database.table...', $link);
// display results
?>
</body>
</html>