Saving Emoticons/Unicode from Twitter to a MySQL Database

The last couple of days I’ve been developing a little app that imports Twitter feeds and saves tweets to a MySQL database; however, I quickly hit an issue where certain tweets would not save. MySQL was returning the error message ‘SQLSTATE[HY000]: General error: 1366 Incorrect string value’. Looking at the specific tweets causing the database to throw the error I noticed that they contained emoticons.

The Problem

I was using utf8 encoding in MySQL which I thought (wrongly) could represent all unicode characters.

The unicodes for emoticons are fully supported by the UTF-8 encoding; however, MySQL’s utf8 does not! To save emoticons to a MySQL database we need to use utf8mb4.

The difference between MySQL’s utf8 and utf8mb4 is that the former can only store 3 byte characters whereas the latter can store 4 byte ones. Therefore with utf8 we can only store unicode characters from the Basic Multilingual Plane. Put more simply, utf8 is suitable for characters from the majority of modern languages and some symbols. Emoticon characters exist in the Supplementary Multilingual Plane for which we need to use utf8mb4.

The Fix

In order to save emoticons and other unicode not supported by MySQL’s utf8 we need to change the character set and collation properties of our database, tables and columns. As long as our database is currently using utf8 there should be no risk of data loss as utf8mb4 is fully backwards compatible with it. However, it is still a good idea to make a backup of the database before running the following commands.

Let’s first set the connection encoding to utf8mb4 and then change the database’s character set and collation to it:-

SET NAMES utf8mb4;
ALTER DATABASE database_name CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;

(Replacing database_name with the name of the database.)

Next we need to convert the relevant tables to utf8mb4/utf8mb4_unicode_ci (you will need to run this for each table):-

ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

(Replacing table_name with the name of the database table.)

Finally we need to update the character set and collation for the column(s):-

ALTER TABLE table_name CHANGE column_name column_name VARCHAR(140) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL;

Read that last command carefully as it will need modifying to your specific needs. The important bit to note is the CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci. (As with the other commands you need to replace table_name and column_name appropriately.)

Bear in mind that changing from utf8 to utfmb4 affects the maximum lengths of columns and index keys. The maximum lengths of columns will be unchanged when modifying the tables with the commands discussed here; but in terms of actual characters the amount that can be stored is reduced. This is a consequence of changing from being able to store only 3 bytes up to 4! You may need to update these lengths if this is relevant to you.

It may be a good idea to make sure every table that has been converted is repaired and optimised after making the above changes. You can run the following for each modified table:-

REPAIR TABLE table_name;
OPTIMIZE TABLE table_name;

Alternatively, use mysqlcheck to repair and optimises everything in one go:-

$ mysqlcheck -u root -p --auto-repair --optimize --all-databases

Finally

Everything should be ready now, but don’t forget to update how your app connects to the database so that it uses the new encoding. For example, in CakePHP 3 the datasource’s encoding needs updating to utf8mb4:-

'Datasources' => [
    'default' => [
        ...
        'database' => 'database_name',
        'encoding' => 'utf8mb4',
        ...
    ]
]

We should now be good to go and start saving tweets containing emoticons (or anything else using them).

I hope that this has been informative and useful to you.

Related Content

Published on


Comments

  1. Andrew Chilton |

    Cool, thanks for this info! It’s amazing that you’d totally think something called ‘utf8’ would totally work … though I guess MySql is sometimes quite special. :)

  2. James |

    This was a super valuable post.
    A little grammar fix:
    “Bare in mind” should be “bear in mind”.

  3. Andy |

    @James, thanks. I always get ‘bear’ and ‘bare’ muddled up. Glad the post was useful.

  4. Gareth |

    One other edge-case thing to mention. I found your blog after googling around this issue and I know you work for Evoluted, who actually designed the original version of the app I now develop for and manage.
    Our app uses some filter methods for converting CakePHP 2.x’s named parameters into an array of conditions which is then used in $model->find() calls. Named parameters are ASCII-encoded by Cake internally, so they can’t be used in a database call without converting to UTF-8.
    I wrote a method to be called in AppController::beforeFilter() to convert all named params to UTF-8 so they can be used in this scenario. Hope that’s useful to someone else.
    This doesn’t apply to Cake 3 because it doesn’t have named parameters, but i guess the same could apply to query strings.

        protected function convertNamedParamsToUtf8()
        {
            foreach ($this->params->params[“named”] as $k => &$namedParam) {
                $this->params->params[“named”][$k] = mb_convert_encoding($namedParam, “UTF-8”, “ASCII”);
            }
        }
    	
  5. plonknimbuzz |

    hi andy, i stuck for 2 days because of this charset.
    i set
    html5 charset=utf8
    mysql charset to utfmb4
    mysql collate to utfmb4_unicode_ci

    but still didnt work
    but your last statement to change framework db charset, make my prob solved. I used Codeigniter, so the setting in database.php is like this:

    $db[‘default’] = array(
    …. ‘char_set’ => ‘utf8mb4’, ‘dbcollat’ => ‘utf8mb4_unicode_ci’,
    ….
    );

    thanks for your article, very helpfull

  6. Karen |

    You saved me!! Thanks :D

  7. Rajat jain |

    This is cool to save the emoji in mysql database but when i fetch the record and want to edit it then it return the image file my field is (blob) in mysql.

    the how it is return the unicode of that emoji.

    i have converted it in like this

    meme.Text =Encoding.UTF8.GetString((byte[])reader[“text”]);

    but it not works…

  8. Robert |

    Why does the entire database need to be changed? Why can’t I just change the collation for the field in question. (Note: I tried doing this and it didn’t work, so I accept that I can’t. I’m just trying to understand why.)

Leave a Comment
  • You will need to preview your comment before you can submit it.