Comparison of Unicode collations (MySQL)
|Author:||Stefán Örvar Sigmundsson|
|Written in:||English (United Kingdom)|
Unicode is a technical standard that has become the predominant character set along with its UTF-8 character encoding. The purpose of this article is to explain the difference between the Unicode collations offered by MySQL, the open-source relational database management system (RDBMS).
The following terminology must be clearly understood in order to select the appropriate Unicode collation:
- Character set
- A character set is a collection of characters that a system should be able to understand and display. Many character sets are small and thus only serve a limited number of languages and others are large and serve all the languages of the world. A character set can have one or more corresponding character encodings.
- Character encoding
- A character encoding specifies how the characters in a character set are represented in memory. Character encodings can be used to save memory by limiting the number of characters in a character set that can be represented. Many systems do not need to be able to handle all of the languages of the world and so character encodings that represent a limited range of characters in a character set are often useful.
- A collation specifies the alphabetical and numerical order within a character encoding. Collations can be generic or specific to particular languages.
The following is a comparison of the Unicode character encodings as named and implemented by MySQL:
|ucs2||2–4 bytes||big-endian||Made obsolete by utf16 which supports supplementary characters.|
|utf8||1–3 bytes||big-endian||Optimum for text primarily of Western languages. Supports fewer characters than utf8mb4.|
|utf8mb4||1–4 bytes||big-endian||Optimum for text primarily of Western languages. Has become the standard of the Internet.|
|utf16||2–4 bytes||big-endian||Optimum for text that mixes Western and Eastern/ancient languages.|
|utf16le||2–4 bytes||little-endian||Optimum when utf16 is required and the byte order must be little-endian.|
|utf32||4 bytes||big-endian||Optimum for text primarily of Eastern/ancient languages.|
Collations can be set for the following data units: databases, tables and columns. The default collation of a database can be overridden per table and the default collation of a table can be overridden per column.
A language-specific collation should be chosen for a data unit when it can be assumed that only text of the corresponding language will be stored in the data unit. A language-generic collation should be chosen for a data unit when no language-specific collation exists for the language or when it can be assumed that multiple languages will be stored in the data unit.
MySQL offers collations for Croatian, Czech, Danish, Esperanto, Estonian, German, Hungarian, Icelandic, Latin, Latvian, Lithuanian, Persian, Polish, Romanian, Sinhalese, Slovak, Slovenian, Spanish, Swedish, Turkish and Vietnamese. The following collations need further explanation:
- This collation is intended for German telephone directory sorting. There is no *_german.ci collation and a language-generic collation may be more suitable for German.
- This collation is intended for Latin. It was named so to avoid confusion with the ISO/IEC 8859-1 character set which is often referred to as latin1.
- This collation is intended for modern Spanish.
- This collation is intended for traditional Spanish.
The best choice for collation is usually *_unicode_520_ci when compatibility with older systems is not of concern.
- This collation is the most language-generic as it sorts text by its binary values and ignores all general language rules. It is case-sensitive unlike all the other collations.
- This collation is less accurate but faster than *_unicode_ci. It does not differentiate between characters such as a and á.
- This collation is less accurate but faster than *_unicode_ci. It does not differentiate between characters such as a and á. It is backward compatible with *_general.ci of MySQL 5.1.24 and older.
- This collation is more accurate but slower than *_general_ci. It does differentiate between characters such as a and á. It is based on Unicode Collation Algorithm 4.0.
- This collation is more accurate but slower than *_general_ci. It does differentiate between characters such as a and á. It is based on Unicode Collation Algorithm 5.2.