Data Encoding: A Guide to UTF-8 for PHP and MySQL

As a MySQL or PHP developer, once you step beyond the comfortable confines of English-only character sets, you quickly find yourself entangled in the wonderfully wacky world of UTF-8.
A Quick UTF-8 Primer

Unicode is a widely-used computing industry standard that defines a comprehensive mapping of unique numeric code values to the characters in most of today's written character sets to aid with system interoperability and data interchange.

UTF-8 is a variable-width encoding that can represent every character in the Unicode character set. It was designed for backward compatibility with ASCII and to avoid the complications of endianness and byte order marks in UTF-16 and UTF-32. UTF-8 has become the dominant character encoding for the World Wide Web, accounting for more than half of all Web pages.

UTF-8 encodes each character using one to four bytes. The first 128 characters of Unicode correspond one-to-one with ASCII, making valid ASCII text also valid UTF-8-encoded text. It is for this reason that systems that are limited to use of the English character set are insulated from the complexities that can otherwise arise with UTF-8.

For example, the Unicode hexidecimal code for the letter A is U+0041, which in UTF-8 is simply encoded with the single byte 41. In comparison, the Unicode hexidecimal code for the character  is U+233B4, which in UTF-8 is encoded with the four bytes F0 A3 8E B4.
On a previous job, we began running into data encoding issues when displaying bios of artists from all over the world. It soon became apparent that there were problems with the stored data, as sometimes the data was correctly encoded and sometimes it was not.
This led programmers to implement a hodge-podge of patches, sometimes with JavaScript, sometimes with HTML charset meta tags, sometimes with PHP, and so on. Soon, we ended up with a list of 600,000 artist bios with double- or triple-encoded information, with data being stored in different ways depending on who programmed the feature or implemented the patch. A classical technical rat's nest.
Indeed, navigating through UTF-8 related data encoding issues can be a frustrating and hair-pulling experience. This post provides a concise cookbook for addressing these issues when working with PHP and MySQL in particular, based on practical experience and lessons learned (and with thanks, in part, to information discovered here and here along the way).
Data encoding with UTF-8 unicode for PHP and MySQL makes complex languages simple.
Specifically, we'll cover the following in this post:
  • Mods you'll need to make to your php.ini file and PHP code.
  • Mods you'll need to make to your my.ini file and other MySQL-related issues to be aware of (including config mods needed if you're using Sphinx)
  • How to migrate data from a MySQL database previously encoded in latin1 to instead use a UTF-8 encoding

PHP & UTF-8 Encoding – modifications to your php.ini file:

The first thing you need to do is to modify your php.ini file to use UTF-8 as the default character set:
 default_charset = "utf-8";  
(Note: You can subsequently use phpinfo() to verify that this has been set properly.)
OK cool, so now PHP and UTF-8 should work just fine together. Right?
Well, not exactly. In fact, not even close.
While this change will ensure that PHP always outputs UTF-8 as the character encoding (in browser response Content-type headers), you still need to make a number of modifications to your PHP code to make sure that it properly processes and generates UTF-8 characters.

PHP & UTF-8 Encoding – modifications to your code:

To be sure that your PHP code plays well in the UTF-8 data encoding sandbox, here are the things you need to do:
  • Set UTF-8 as the character set for all headers output by your PHP code
    In every PHP output header, specify UTF-8 as the encoding:
      header('Content-Type: text/html; charset=utf-8');  
  • Specify UTF-8 as the encoding type for XML
      <?xml version="1.0" encoding="UTF-8"?>  
  • Strip out unsupported characters from XML
    Since not all UTF-8 characters are accepted in an XML document, you'll need to strip any such characters out from any XML that you generate. A useful function for doing this (which I found here) is the following:
      function utf8_for_xml($string)    {      return preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u',                          ' ', $string);    }  
    Here's how you can use this function in your code:
      $safeString = utf8_for_xml($yourUnsafeString);    
  • Specify UTF-8 as the character set for all HTML content
    For HTML content, specify UTF-8 as the encoding:
      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">  
    In HTML forms, specify UTF-8 as the encoding:
      <form accept-charset="utf-8">  
  • Specify UTF-8 as the encoding in all calls to htmlspecialchars
    e.g.:
      htmlspecialchars($str, ENT_NOQUOTES, "UTF-8")  
    *Note: As of PHP 5.6.0, default_charset value is used as the default. From PHP 5.4.0, UTF-8 was the default, but prior to PHP 5.4.0, ISO-8859-1 was used as the default. It's therefore a good idea to always explicitly specify UTF-8 to be safe, even though this argument is technically optional.
    Also note that, for UTF-8, htmlspecialchars and htmlentities can be used interchangeably.
  • Set UTF-8 as the default character set for all MySQL connections
    Specify UTF-8 as the default character set to use when exchanging data with the MySQL database using mysql_set_charset:
      $link = mysql_connect('localhost', 'user', 'password');    mysql_set_charset('utf8', $link);  
    Note that, as of PHP 5.5.0, mysql_set_charset is deprecated, and mysqli::set_charset should be used instead:
      $mysqli = new mysqli("localhost", "my_user", "my_password", "test");          /* check connection */    if (mysqli_connect_errno()) {        printf("Connect failed: %s\n", mysqli_connect_error());        exit();    }          /* change character set to utf8 */    if (!$mysqli->set_charset("utf8")) {        printf("Error loading character set utf8: %s\n", $mysqli->error);    } else {        printf("Current character set: %s\n", $mysqli->character_set_name());    }          $mysqli->close();  
  • Always use UTF-8 compatible versions of string manipulation functions
    There are several PHP functions that will fail, or at least not behave as expected, if the character representation needs more than 1 byte (as UTF-8 does). An example is the strlen function that will return the number of bytes rather than the number of characters.
    Two options are available for dealing with this:
    • The iconv functions that are available by default with PHP provide multibyte compatible versions of many of these functions (e.g., iconv_strlen, etc.). Remember, though, that the strings you provide to these functions must themselves be properly encoded.
    • There is also the mbstring extension to PHP (information on enabling and configuring it is available here). This extension provides a comprehensive set of functions that properly account for multibyte encoding.

MySQL & UTF-8 Encoding – modifications to your my.ini file:

On the MySQL/UTF-8 side of things, modifications to the my.ini file are required as follows:
  • Set the following config parameters after each corresponding tag:
      [client]    default-character-set=UTF-8          [mysql]    default-character-set=UTF-8          [mysqld]    character-set-client-handshake = false #force encoding to uft8    character-set-server=UTF-8    collation-server=UTF-8_general_ci          [mysqld_safe]    default-character-set=UTF-8  
  • After making the above changes to your my.ini file, restart your MySQL daemon.
  • To verify that everything has properly been set to use the UTF-8 encoding, execute the following query:
      mysql> show variables like 'char%';  
    The output should look something like:
          | character_set_client        | UTF-8                               | character_set_connection    | UTF-8                               | character_set_database      | UTF-8                               | character_set_filesystem    | binary                            | character_set_results       | UTF-8                               | character_set_server        | UTF-8                               | character_set_system        | UTF-8                               | character_sets_dir          | /usr/share/mysql/charsets/  
    If you instead see latin1 listed for any of these, double-check your configuration and make sure you've properly restarted your mysql daemon.

MySQL & UTF-8 Encoding – other things to consider:

  • MySQL UTF-8 is actually a partial implementation of the full UTF-8 character set. Specifically, MySQL UTF-8 data encoding uses a maximum of 3 bytes, whereas 4 bytes are required for encoding the full UTF-8 character set. This is fine for all language characters, but if you need to support astral symbols (whose code points range from U+010000 to U+10FFFF), those require a four byte encoding which is not supported in MySQL UTF-8. In MySQL 5.5.3, this was addressed with the addition of support for the utf8mb4 character set which uses a maximum of four bytes per character and thereby supports the full UTF-8 character set. So if you're using MySQL 5.5.3 or later, use utf8mb4 instead of UTF-8 as your database/table/row character set. More info is available here.
  • If the connecting client has no way to specify the encoding for its communication with MySQL, after the connection is established you may have to run the following command/query:
      set names UTF-8;  
  • When determining the size of varchar fields when modeling the database, don't forget that UTF-8 characters may require as many as 4 bytes per character.

MySQL & UTF-8 Encoding – if you use Sphinx:

  • In your Sphinx configuration file (i.e., sphinx.conf):
    • Set your index definition to have:
      charset_type = utf-8  
    • Add the following to your source definition:
      sql_query_pre = SET CHARACTER_SET_RESULTS=UTF-8  sql_query_pre = SET NAMES UTF-8  
  • Restart the engine and remake all indices.
  • If you want to configure sphinx so that letters like C c Ć ć Ĉ ĉ Ċ ċ Č č are all treated as equivalent for search purposes, you will need to configure a charset_table (a.k.a. character folding) which is essentially an equivalency mapping between characters. More information is available here.

MySQL – Migrating a database data that is already encoded in latin1 to UTF-8

If you have an existing MySQL database that is already encoded in latin1, here's how to convert the latin1 to UTF-8:
  1. Make sure you've made all the modifications to the configuration settings in your my.ini file, as described above.
  2. Execute the following command:
     ALTER SCHEMA `your-db-name` DEFAULT CHARACTER SET UTF-8;  
  3. Via command line, verify that everything is properly set to UTF-8
     mysql> show variables like 'char%';  
  4. Create a dump file with latin1 encoding for the table you want to convert:
     mysqldump -u USERNAME -pDB_PASSWORD --opt --skip-set-charset --default-character-set=latin1             --skip-extended-insert DATABASENAME --tables TABLENAME >             DUMP_FILE_TABLE.sql  
    e.g:
     mysqldump -u root --opt --skip-set-charset  --default-character-set=latin1             --skip-extended-insert artists-database --tables tbl_artist >             tbl_artist.sql  
  5. Do a global search and replace of the charset in the dumpfile from latin1 to UTF-8:
    e.g., using Perl:
     perl -i -pe 's/DEFAULT CHARSET=latin1/DEFAULT CHARSET=UTF-8/' DUMP_FILE_TABLE.sql  
    Note to Windows users: This charset string replacement (from latin1 to UTF-8) can also be done using find-and-replace in WordPad (or some other text editor, such as vim). Be sure to save the file just as it is though (don't save it as unicode txt file!).
  6. From this point, we will start messing with the database data, so it would probably be prudent to backup the database if you haven't already done so. Then, restore the dump into the database:
     mysql> source "DUMP_FILE_TABLE.sql";  
  7. Search for any records that may not have converted properly and correct them. Since non-ASCII characters are multi-byte by design, we can find them by comparing the byte length to the character length (i.e., to identify rows that may hold double-encoded UTF-8 characters that need to be fixed).
    • See if there are any records with multi-byte characters (if this query returns zero, then there don't appear to be any records with multi-byte characters in your table and you can proceed to Step 8).
        mysql> select count(*) from MY_TABLE where LENGTH(MY_FIELD) != CHAR_LENGTH(MY_FIELD);  
    • Copy rows with multi-byte characters into a temporary table:
        create table temptable (        select * from MY_TABLE where        LENGTH(MY_FIELD) != CHAR_LENGTH(MY_FIELD));  
    • Convert double-encoded UTF-8 characters to proper UTF-8 characters
      This is actually a bit tricky. A double encoded string is one that was properly encoded as UTF-8. However, MySQL then did us the erroneous favor of converting it (from what it thought was latin1) to UTF-8 again, when we set the column to UTF-8 encoding. Resolving this therefore requires a two step process through which we "trick" MySQL in order to preclude it from doing us this "favor".
      First, we set the encoding type for the column back to latin1, thereby removing the double encoding:
      e.g.:
        alter table temptable modify temptable.ArtistName varchar(128) character set latin1;  
      Note: Be sure to use the correct field type for your table. In the example above, for our table, the correct field type for 'ArtistName' was varchar(128), but the field in your table could be text or any other type. Be sure to specify it properly!
      The problem is that now, if we set the column encoding back to UTF-8, MySQL will run the latin1 to UTF-8 data encoding for us again and we'll be back to where we started. To avoid this, we change the column type to blob and THEN we set it to UTF-8. This exploits the fact that MySQL will not attempt to encode a blob. We are thereby able to "fool" the MySQL charset conversion to avoid the double encoding issue.
      e.g.:
        alter table temptable modify temptable.ArtistName blob;    alter table temptable modify temptable.ArtistName varchar(128) character set UTF-8;  
      (Again, as noted above, be sure to use the proper field type for your table.)
    • Remove rows with only single-byte characters from the temporary table:
        delete from MY_TABLE where LENGTH(MY_FIELD) = CHAR_LENGTH(MY_FIELD);  
    • Re-insert fixed rows back into the original table (before doing this, you may want to run some selects on the temptable to verify that it appears to be properly corrected, just as a sanity check).
        replace into MY_TABLE (select * from temptable);  
  8. Verify the remaining data and, if necessary, repeat the process in step 7 (this could be necessary, for example, if the data was triple encoded). Further errors, if any, may be easiest to resolve manually.

Source code and resource files

One other thing to remember and verify is that your source code files, resources files, and so on, are all being saved properly with UTF-8 data encoding. Otherwise, any "special" characters in these files may not be handled correctly.
In Netbeans, for example, you can right-click on your project, choose properties and then in "Sources" you will find the data encoding option (it usually defaults to UTF-8, but it's worth checking).
Or in Windows Notepad, use the "Save As…" option in the File menu, and select the UTF-8 encoding option at the bottom of the dialog. (Note that the "Unicode" option that Notepad provides is actually UTF-16, so that's not what you want.)

Wrap-up

Although it can be somewhat tedious, taking the time to go through these steps to systematically address your MySQL and PHP UTF-8 data encoding issues can ultimately save you a great deal of time and grief. In the long run, this type of methodical approach is far superior to the all-too-common tendency to just keep patching the system.
This guide hopefully emphasizes the importance of taking the charset definition into consideration when setting up a project environment in the first place and working in a software project environment that properly accounts for character encoding in its manipulation of text and strings.
This article originally appeared on Toptal.


Comments

Anonymous said…
Xamarin is an open-source platform so every person can use their resources. It is the best app in the market, saving plenty of money and time. That is why it's quite popular amongst developers and organisations. This results in increased demand for the Xamarin framework, the future of Xamarin developers is secured.