Going from iso-8859-1 to utf-8

I’d like to start using utf-8 for character encoding. This whole site is currently in iso-8859-1. Each time I try using utf-8, I run into problems with characters not showing up, or showing up as a garbled mess.

I have tried RTFM, Google, MT support forums, screaming, banging my head against the wall, and taking a walk. None of that helped. I’m hoping someone else has gone from iso-8859-1 to utf-8 on a Movable Type setup before me, and can suggest the best way of converting it to utf-8

Here’s my setup, assuming that’s important:

  • Movable Type 2.65
  • MySQL
  • All previous entries are iso-8859-1
  • Most posts are in English, but some are in Swedish and contain accented characters. Well, English does too, so that shouldn’t matter. But I’m mentioning it because the posts that get messed up are the ones that are in Swedish.
  • Documents are XHTML 1.0 Strict, and served as application/xhtml+xml to browsers that support it. They get converted to HTML 4.01 Strict and are served as text/html to others.

This is what I did last time I tried this:

  • Exported my entries from MT
  • Opened the export file, saved as utf-8.
  • Opened power-editing mode, deleted all entries (which took forever)
  • Reimported everything
  • Converted all templates to utf-8
  • Changed the content negotiation script to send utf-8 in the Content-Type header
  • Opened mt.cfg, changed PublishCharset Shift_JIS to PublishCharset utf-8, and uncommented NoHTMLEntities 1
  • Rebuilt everything
  • Screamed loudly

What am I doing wrong?

Posted on September 4, 2004 in Movable Type

Comments

  1. just try to use iconv before display page.

  2. September 4, 2004 by caffènero

    I have similar problems, Roger. In my specific case, the dumb IE refuses to display UTF-8 “special” chars (arrows and such), while FF and all the other browsers correctly display them. With the SAME font (verdana), on the SAME platform (winXP). I know this doesn’t solve your issue, but may warn you on future ones. :(

    Seems UTF-8 isn’t well-supported yet, no matter how you force apache, editors and browsers to use the right ancoding…

  3. Open up mt.cfg, and there’s a line for setting the default character set for all your blogs. Mine currently looks like:

    PublishCharset UTF-8

    I assume MT uses that to output entries in that character set. I’ve never checked it, but I haven’t found any validation problems related to characters either.

    Here’s a suggestion (if that doesn’t work for your past entries). Export all entries, then do a search-n-replace for the several problematic Swedish characters to change them into their numeric entity equivalents. Then import the entries. If you do that, and then have the PublishCharset value as UTF-8, you should have no problems with future entries and the past ones will work without you having to go through and edit them one by one.

  4. You have to make a dump of your database and convert it to UTF-8 with a capable text editor, then push it back.

    “Straight-out” conversion is only possible when you have plain ASCII text (not your case).

  5. September 5, 2004 by Roger Johansson (Author comment)

    Devon: I tried changing that line in mt.cfg. Didn’t help. Doing a search and replace sounds like it’s worth a try.

    Julik: That’s what I did, if you mean exporting the database from within MT.

  6. I’m not familiar enough with Movable Type to be sure this is relevant, but it may also be worth making sure that your installation of MySQL is version 4. Until version 4, MySQL tables couldn’t store UTF8 data properly.

  7. September 6, 2004 by Björn

    Why do you want to change encoding? There must be a good reason but I can’t see it (not to good at the encoding-stuff though:).

  8. September 6, 2004 by Roger Johansson (Author comment)

    Andrew: Thanks for the tip. The server is running MySQL 4.something, so it should be able to handle Unicode.

    Björn: Partly because it’s what I “should” use, partly because I want to learn from it.

  9. Roger,

    I’m not sure of the validity of this article (I haven’t tried it myself), but I found an entry on someone’s blog stating you need to change the send_http_header routine in App.pm. It also mentions something about Apache configuration, although I’m assuming you’ve already covered that one off.

    Let us how you get on, I’m thinking about going utf-8 on MT when I next upgrade/redesign…

  10. September 8, 2004 by Roger Johansson (Author comment)

    I still haven’t revisited the utf-8 problem. It looks like upgrading to MT 3 could help. I’ll look into that, as well as take a closer look at other systems, when I have time.

  11. November 16, 2004 by Chvora

    I am trying to email uft8 and it gives me a headache.

    I belive I had the same problem displaying utf8 data on the website. FF will display it properly but IE won’t AFAIK I the html template had hardcoded the charset, you might have to change it to utf8. FF for some reason correctly auto switches to utf while IE does what its told and displays it as iso.

  12. Have you checked the settings for your server? Apache 2.0 has a default override to iso-8859-1

  13. FWIW, I’ve documented how I converted my MT blog from ISO-8859-1 to UTF-8.

    http://padawan.info/weblog/convertingamovabletypeblogfromiso88591toutf8.html

    The trick is to make sure that ALL the elements of the chain, from the content to the web server all inclusive are in UTF-8. It’s not a matter of just changing one setting in MT or Apache or MySQL, it must be consistent all the way through, starting with the content.

  14. Oops, Markdown messes with your autolink feature. the link above is:

    Converting a Movable Type blog from ISO-8859-1 to UTF-8

  15. December 29, 2004 by Roger Johansson (Author comment)

    padawan: Thanks, you seem to have run into much the same problems I did. My problems were probably caused by not converting the database plus using an ftp application that somehow screwed up character encoding during transfer. I’ll keep your article handy next time I give this a try.

Comments are disabled for this post (read why), but if you have spotted an error or have additional info that you think should be in this post, feel free to contact me.