There are no notfications.

Comparison of Unicode collations (MySQL)

Comparison of Unicode collations (MySQL)
Author: Stefán Örvar Sigmundsson
Initial publication:
Last updated:
Written in: English (United Kingdom)

Unicode is a technical standard that has become the predominant character set along with its UTF-8 character encoding. The purpose of this article is to explain the difference between the Unicode collations offered by MySQL, the open-source relational database management system (RDBMS).

Terminology

The following terminology must be clearly understood in order to select the appropriate Unicode collation:

Character set
A character set is a collection of characters that a system should be able to understand and display. Many character sets are small and thus only serve a limited number of languages and others are large and serve all the languages of the world. A character set can have one or more corresponding character encodings.
Character encoding
A character encoding specifies how the characters in a character set are represented in memory. Character encodings can be used to save memory by limiting the number of characters in a character set that can be represented. Many systems do not need to be able to handle all of the languages of the world and so character encodings that represent a limited range of characters in a character set are often useful.
Collation
A collation specifies the alphabetical and numerical order within a character encoding. Collations can be generic or specific to particular languages.

Character encodings

The following is a comparison of the Unicode character encodings as named and implemented by MySQL:

Encoding Size Byte order Notes
ucs2 2–4 bytes big-endian Made obsolete by utf16 which supports supplementary characters.
utf8 1–3 bytes big-endian Optimum for text primarily of Western languages. Supports fewer characters than utf8mb4.
utf8mb4 1–4 bytes big-endian Optimum for text primarily of Western languages. Has become the standard of the Internet.
utf16 2–4 bytes big-endian Optimum for text that mixes Western and Eastern/ancient languages.
utf16le 2–4 bytes little-endian Optimum when utf16 is required and the byte order must be little-endian.
utf32 4 bytes big-endian Optimum for text primarily of Eastern/ancient languages.

Collations

Collations can be set for the following data units: databases, tables and columns. The default collation of a database can be overridden per table and the default collation of a table can be overridden per column.

A language-specific collation should be chosen for a data unit when it can be assumed that only text of the corresponding language will be stored in the data unit. A language-generic collation should be chosen for a data unit when no language-specific collation exists for the language or when it can be assumed that multiple languages will be stored in the data unit.

Language-specific

MySQL offers collations for Croatian, Czech, Danish, Esperanto, Estonian, German, Hungarian, Icelandic, Latin, Latvian, Lithuanian, Persian, Polish, Romanian, Sinhalese, Slovak, Slovenian, Spanish, Swedish, Turkish and Vietnamese. The following collations need further explanation:

*_german2_ci
This collation is intended for German telephone directory sorting. There is no *_german.ci collation and a language-generic collation may be more suitable for German.
*_roman_ci
This collation is intended for Latin. It was named so to avoid confusion with the ISO/IEC 8859-1 character set which is often referred to as latin1.
*_spanish_ci
This collation is intended for modern Spanish.
*_spanish2_ci
This collation is intended for traditional Spanish.

Language-generic

The best choice for collation is usually *_unicode_520_ci when compatibility with older systems is not of concern.

*_bin
This collation is the most language-generic as it sorts text by its binary values and ignores all general language rules. It is case-sensitive unlike all the other collations.
*_general.ci
This collation is less accurate but faster than *_unicode_ci. It does not differentiate between characters such as a and á.
*_general_mysql500_ci
This collation is less accurate but faster than *_unicode_ci. It does not differentiate between characters such as a and á. It is backward compatible with *_general.ci of MySQL 5.1.24 and older.
*_unicode_ci
This collation is more accurate but slower than *_general_ci. It does differentiate between characters such as a and á. It is based on Unicode Collation Algorithm 4.0.
*_unicode_520_ci
This collation is more accurate but slower than *_general_ci. It does differentiate between characters such as a and á. It is based on Unicode Collation Algorithm 5.2.

External links