Have you ever found yourself stuck with gibberish characters in a database that you’ve just imported? Is your data looking like someone’s tried to mask expletives? Then perhaps you’ve found yourself exporting a
latin1 encoded database containing UTF-8 characters.
Every now and then I find myself needing to migrate a database from one place to another; usually when preparing to replace an old site with a new one. Unfortunately this can sometimes throw up some encoding issues. An apparently working original seems broken with unexpected characters like “�Ã¡¥” in the copy.
The default MySQL character set and collation are
latin1_swedish_ci (the Finnish creator of MySQL, Michael Widenius, was co-head of a Swedish firm). As a result there are many existing databases out there using
latin1. If someone then inserts UTF-8 data into one of these everything can look fine until it comes to importing this into a new database.
One way you can check if this is your problem is to query the imported database by setting the
CHARACTER_SET_RESULTS from within the mysql client (via the terminal rather than something like PhpMyAdmin).
Try querying your database for some of the broken data. Then set
CHARACTER_SET_RESULTS before running the query again.
SET SESSION CHARACTER_SET_RESULTS = latin1;
If the data shows correctly the second time then it looks like the problem being described here is your issue too.
Thankfully we can resolve the issue by taking a few short steps. We want to export the structure and data of the existing database separately; change the exported schema’s character set; and then import the tweaked structure and data into the new database.
One thing to note before we proceed with the fix is that MySQL’s
utf8 encoding does not support all unicode characters. For true UTF-8 support we want to use MySQL’s
utf8mb4 character set. For example, we’d need to use
utf8mb4 if we wanted to store the increasinly popular emoji characters correctly in our database. For more on this check out my article on Saving Emoticons/Unicode from Twitter to a MySQL Database.
I’m going to assume that the database we’re importing into is
utf8mb4. If you’ve not already created this database then set it up now.
CREATE DATABASE `[new_database]` DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
So let’s look at solving our issue with our
latin1 tables. To start with we want to export the database’s schema using
mysqldump and change its character set to
mysqldump -u[username] -p [existing_database] --no-data --skip-set-charset --default-character-set=latin1 \ | sed 's/CHARSET=latin1/CHARSET=utf8mb4/g' \ > schema.sql
There are three parts to this command: we use
mysqldump to export the data; this is then piped to
sed which updates the
charset in the export; then finally we write the output to a file named schema.sql.
- We’re using
--no-datato only export the schema (this bit is important).
- We’re also using
--default-character-set=latin1to ensure that MySQL doesn’t attempt to reconvert or set the charset.
sedis replacing every occurence of
utf8if you don’t need the full UTF-8 character set.
Next we want to export the data in our database.
mysqldump -u[username] -p [existing_database] --no-create-db --no-create-info --skip-set-charset --default-character-set=latin1 > data.sql
There are just two parts to this command: again we use
mysqldump but this time just to export the data and then we write this output to data.sql.
- We use
--no-create-dbto exclude the
- We also use
--no-create-infoto exclude any
CREATE TABLEstatements as our previous export is handling the structure.
Now it’s time to import the exported schema and data to our new UTF-8 database. Open a connection to the new database using
utf8 if that’s what you are using) as the default character set.
mysql -u[username] -p [new_database] --default-character-set=utf8mb4
Finally, import the schema and data.
source schema.sql; source data.sql;
Hopefully everything should now be working correctly in the new database and the gibberish characters gone!