What the �Ã¡¥! Exporting UTF-8 Data From a latin1 MySQL Table

Have you ever found yourself stuck with gibberish characters in a database that you’ve just imported? Is your data looking like someone’s tried to mask expletives? Then perhaps you’ve found yourself exporting a latin1 encoded database containing UTF-8 characters.

The Problem

Every now and then I find myself needing to migrate a database from one place to another; usually when preparing to replace an old site with a new one. Unfortunately this can sometimes throw up some encoding issues. An apparently working original seems broken with unexpected characters like “�Ã¡¥” in the copy.

The default MySQL character set and collation are latin1 and latin1_swedish_ci (the Finnish creator of MySQL, Michael Widenius, was co-head of a Swedish firm). As a result there are many existing databases out there using latin1. If someone then inserts UTF-8 data into one of these everything can look fine until it comes to importing this into a new database.

One way you can check if this is your problem is to query the imported database by setting the CHARACTER_SET_RESULTS from within the mysql client (via the terminal rather than something like PhpMyAdmin).

Try querying your database for some of the broken data. Then set CHARACTER_SET_RESULTS before running the query again.

SET SESSION CHARACTER_SET_RESULTS = latin1;

If the data shows correctly the second time then it looks like the problem being described here is your issue too.

The Fix

Thankfully we can resolve the issue by taking a few short steps. We want to export the structure and data of the existing database separately; change the exported schema’s character set; and then import the tweaked structure and data into the new database.

One thing to note before we proceed with the fix is that MySQL’s utf8 encoding does not support all unicode characters. For true UTF-8 support we want to use MySQL’s utf8mb4 character set. For example, we’d need to use utf8mb4 if we wanted to store the increasinly popular emoji characters correctly in our database. For more on this check out my article on Saving Emoticons/Unicode from Twitter to a MySQL Database.

I’m going to assume that the database we’re importing into is utf8mb4. If you’ve not already created this database then set it up now.

CREATE DATABASE `[new_database]` DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

So let’s look at solving our issue with our latin1 tables. To start with we want to export the database’s schema using mysqldump and change its character set to utf8mb4.

mysqldump -u[username] -p [existing_database] --no-data --skip-set-charset --default-character-set=latin1 \
| sed 's/CHARSET=latin1/CHARSET=utf8mb4/g' \
> schema.sql

There are three parts to this command: we use mysqldump to export the data; this is then piped to sed which updates the charset in the export; then finally we write the output to a file named schema.sql.

We’re using --no-data to only export the schema (this bit is important).
We’re also using --skip-set-charset and --default-character-set=latin1 to ensure that MySQL doesn’t attempt to reconvert or set the charset.
sed is replacing every occurence of latin1 for the charset with utf8mb4. Change utf8mb4 to utf8 if you don’t need the full UTF-8 character set.

Next we want to export the data in our database.

mysqldump -u[username] -p [existing_database] --no-create-db --no-create-info --skip-set-charset --default-character-set=latin1 > data.sql

There are just two parts to this command: again we use mysqldump but this time just to export the data and then we write this output to data.sql.

We use --no-create-db to exclude the CREATE DATABASE statement.
We also use --no-create-info to exclude any CREATE TABLE statements as our previous export is handling the structure.

Now it’s time to import the exported schema and data to our new UTF-8 database. Open a connection to the new database using utf8mb4 (or utf8 if that’s what you are using) as the default character set.

mysql -u[username] -p [new_database] --default-character-set=utf8mb4

Finally, import the schema and data.

source schema.sql;
source data.sql;

Hopefully everything should now be working correctly in the new database and the gibberish characters gone!