The last couple of days I’ve been developing a little app that imports Twitter feeds and saves tweets to a MySQL database; however, I quickly hit an issue where certain tweets would not save. MySQL was returning the error message ‘SQLSTATE[HY000]: General error: 1366 Incorrect string value’. Looking at the specific tweets causing the database to throw the error I noticed that they contained emoticons.

The Problem

I was using utf8 encoding in MySQL which I thought (wrongly) could represent all unicode characters.

The unicodes for emoticons are fully supported by the UTF-8 encoding; however, MySQL’s utf8 does not! To save emoticons to a MySQL database we need to use utf8mb4.

The difference between MySQL’s utf8 and utf8mb4 is that the former can only store 3 byte characters whereas the latter can store 4 byte ones. Therefore with utf8 we can only store unicode characters from the Basic Multilingual Plane. Put more simply, utf8 is suitable for characters from the majority of modern languages and some symbols. Emoticon characters exist in the Supplementary Multilingual Plane for which we need to use utf8mb4.

The Fix

In order to save emoticons and other unicode not supported by MySQL’s utf8 we need to change the character set and collation properties of our database, tables and columns. As long as our database is currently using utf8 there should be no risk of data loss as utf8mb4 is fully backwards compatible with it. However, it is still a good idea to make a backup of the database before running the following commands.

Let’s first set the connection encoding to utf8mb4 and then change the database’s character set and collation to it:-

SET NAMES utf8mb4;
ALTER DATABASE database_name CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;

(Replacing database_name with the name of the database.)

Next we need to convert the relevant tables to utf8mb4/utf8mb4_unicode_ci (you will need to run this for each table):-

ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

(Replacing table_name with the name of the database table.)

Finally we need to update the character set and collation for the column(s):-

ALTER TABLE table_name CHANGE column_name column_name VARCHAR(140) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL;

Read that last command carefully as it will need modifying to your specific needs. The important bit to note is the CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci. (As with the other commands you need to replace table_name and column_name appropriately.)

Bear in mind that changing from utf8 to utfmb4 affects the maximum lengths of columns and index keys. The maximum lengths of columns will be unchanged when modifying the tables with the commands discussed here; but in terms of actual characters the amount that can be stored is reduced. This is a consequence of changing from being able to store only 3 bytes up to 4! You may need to update these lengths if this is relevant to you.

It may be a good idea to make sure every table that has been converted is repaired and optimised after making the above changes. You can run the following for each modified table:-

REPAIR TABLE table_name;
OPTIMIZE TABLE table_name;

Alternatively, use mysqlcheck to repair and optimises everything in one go:-

$ mysqlcheck -u root -p --auto-repair --optimize --all-databases

Finally

Everything should be ready now, but don’t forget to update how your app connects to the database so that it uses the new encoding. For example, in CakePHP 3 the datasource’s encoding needs updating to utf8mb4:-

'Datasources' => [
    'default' => [
        ...
        'database' => 'database_name',
        'encoding' => 'utf8mb4',
        ...
    ]
]

We should now be good to go and start saving tweets containing emoticons (or anything else using them).

I hope that this has been informative and useful to you.

Comments

Cool, thanks for this info! It’s amazing that you’d totally think something called ‘utf8’ would totally work … though I guess MySql is sometimes quite special. :)

Andrew Chilton – 16 Jun 2015 9:41am

This was a super valuable post.
A little grammar fix:
“Bare in mind” should be “bear in mind”.

James – 19 Jun 2015 1:27am

@James, thanks. I always get ‘bear’ and ‘bare’ muddled up. Glad the post was useful.

Andy – 20 Jun 2015 9:18am

One other edge-case thing to mention. I found your blog after googling around this issue and I know you work for Evoluted, who actually designed the original version of the app I now develop for and manage.
Our app uses some filter methods for converting CakePHP 2.x’s named parameters into an array of conditions which is then used in $model->find() calls. Named parameters are ASCII-encoded by Cake internally, so they can’t be used in a database call without converting to UTF-8.
I wrote a method to be called in AppController::beforeFilter() to convert all named params to UTF-8 so they can be used in this scenario. Hope that’s useful to someone else.
This doesn’t apply to Cake 3 because it doesn’t have named parameters, but i guess the same could apply to query strings.

    protected function convertNamedParamsToUtf8()
    {
        foreach ($this->params->params[“named”] as $k => &$namedParam) {
            $this->params->params[“named”][$k] = mb_convert_encoding($namedParam, “UTF-8”, “ASCII”);
        }
    }
        

Gareth – 18 Sep 2015 12:54pm

hi andy, i stuck for 2 days because of this charset.
i set
html5 charset=utf8
mysql charset to utfmb4
mysql collate to utfmb4_unicode_ci

but still didnt work
but your last statement to change framework db charset, make my prob solved. I used Codeigniter, so the setting in database.php is like this:

$db[‘default’] = array(
…. ‘char_set’ => ‘utf8mb4’, ‘dbcollat’ => ‘utf8mb4_unicode_ci’,
….
);

thanks for your article, very helpfull

plonknimbuzz – 27 Feb 2017 10:37am

You saved me!! Thanks :D

Karen – 18 Jun 2017 3:38am

This is cool to save the emoji in mysql database but when i fetch the record and want to edit it then it return the image file my field is (blob) in mysql.

the how it is return the unicode of that emoji.

i have converted it in like this

meme.Text =Encoding.UTF8.GetString((byte[])reader[“text”]);

but it not works…

Rajat jain – 4 Aug 2017 7:50am

Why does the entire database need to be changed? Why can’t I just change the collation for the field in question. (Note: I tried doing this and it didn’t work, so I accept that I can’t. I’m just trying to understand why.)

Robert – 7 Aug 2017 6:32pm

Comments are now closed.