Fixing the dirify function in Movable Type

The dirify function Movable Type uses to turn post titles into legal directory names suitable for URLs has serious problems with some accented characters. I could not find a satisfactory fix anywhere, so I hacked up my own. While I was at it, I changed it to use hyphens instead of underscores to separate words in URLs.

I’ve seen the accented character bug before, but haven’t bothered to look for a solution since nearly all posts on 456 Berea Street are in English. However when I was working on another site recently I really needed to fix this to make Swedish post titles turn into reasonably readable URL fragments.

All I wanted dirify to do was convert any accented characters to their non-accented versions:

  1. å => a
  2. ä => a
  3. ö => o
  4. Å => a
  5. Ä => a
  6. Ö => o

Seems simple enough, but that was not what happened. Instead, Movable Type for some reason converts the characters this way:

  1. å =>
  2. ä => ae
  3. ö => oe
  4. Å =>
  5. Ä => ae
  6. Ö => oe

That’s right. The letters “å” and “Å” are simply removed. Hey! They may look odd to the rest of the world but we use them a lot here! Replacing “ä” with “ae” and “ö” with “oe” is not what I want either.

Since I also wanted to use hyphens instead of underscores to separate words in URLs I started looking for plugins that could help out with that, hoping that I might also stumple upon a solution to this character conversion problem. I found a few options: Dashify, Dirifyplus, and Dirify for Unicode. Neither fixed the problem. Dirifyplus (I think) actually made it even worse by converting all accented characters to “a”.

So in the end I decided to find Movable Type’s dirify function and fix it. After a bit of searching I found it in /lib/MT/Util.pm. The separator character is defined on line 544, and the conversion table starts on line 620 (assuming Movable Type 3.2 set up to use UTF-8).

To use dashes instead of underscores, just edit line 544. To make dirify convert Swedish accented characters to something more usable, replace the my %utf8_table hash table with this patched utf8_table.

Happy that I had managed to solve the problem, I went to apply the fix to my other site, Kaffesnobben. Well, that revealed other problems since that site is running Movable Type 3.17. First, the separator character is specified on line 457 instead of 544, and the my %utf8_table hash table starts on line 528 instead of 620. No big deal, it should work anyway, right? Wrong. After applying the patches the Swedish characters were converted properly, but underscores were still being used as separators.

After spending way too long trying to figure out why dirify wouldn’t use hyphens instead of underscores I finally found the answer: I had installed patch-20050616-utf8dirify-nodash.pl, a plugin that fixes another dirify problem. Well, that plugin also overrides the separator character specified in Util.pm. So if you have this plugin installed, make sure to edit the separator character on line 20.

It took a while, but in the end I found a solution to my problem. Hopefully this post will save someone else a bit of frustration.

Posted on October 17, 2005 in Movable Type

Comments

  1. Why they still use the underscore boggles me, since its been known as a fact for a long time that hyphens work much better.

  2. Ohh.. I forgot to say, seing it from the SEO perspective…

  3. You really should have used serendipity :-)

  4. And ä to ae and ö to oe and so on seems to be the german “translation” of these Umlauts. Sorry for making this two posts!

  5. October 17, 2005 by Roger Johansson (Author comment)

    Kim: Changing from underscores to hyphens as URL separators between versions would cause major problems for people upgrading ;-). They should make it an option though, especially for new sites.

    Jannis: I have looked at other systems but decided not to spend time learning something new right now. After all MT does work well enough for me, and I think most people coming here would rather have me focus on writing new articles ;-).

  6. In Swedish I would think it more, historically, correct to convert the characters like

    ö = oe
    ä = ae
    å = aa

    I even think, if I remember correctly, that ö started out as an o with a small e hovering above it. This e then slowly transformed into the two dots.

  7. October 17, 2005 by Isaac Lin

    I wrote my own version of dirify that, in addition to using hyphens, eliminated the shortest words first when trying to fit under the character limit. If there were any ties, the word that came later in the title was deleted first. It was my quick-and-dirty attempt to try to increase the relevance of the URL for the post.

  8. Looking at my own dirify function in our CMS I wasnt sure if I oe or o an ø, but it seemed I had oe from the beginning, :D

    Example : http://www.easycms.no/easy-cms-publiseringsverktoey.html

    I pinned the oe correctly, however, it seems I totally wacked the å : http://www.easycms.no/oppdater-nettsidene-s-ofte-du-vil.html

    Something tells me coffe isn’t always a neccersary evil, :D From a SEO perspective I could argue that the Å or aa characters doesnt have any practical meaning for a search (which is probably why I dropped them when I think of it), then again why did I leave the s then… I need to rewrite my routine I see now, :)

    I remember myself starting with underscores, just as so many others I guessed that this was considered as a space by everyone, but I didn’t do any research at that time in the search engines if it accually mattered. For the naked eye it really doesn’t matter that much, however as SEO shows they are worlds appart.

    But you are right with the moving from dashes to hyphens, especially if you are well indexed it could be a dangerous road. You would need to keep rewriting all dashes to hyphens, check that the page exists and then 301 over to the hyphen version. I did this, but I did it manually by adding lines in the htaccess for each page visiting my log, a timeconsuming job really - but it worked nicely.

    Something like this : RewriteRule ^somepagetitle.php http://www.domain.no/some-page-title.php [R=301,L]

    For those that doesn’t know this, the above line tells the browser to redirect the _ URL to the - URL. A search engine will also be informed that the OLD URL which is already indexed should be removed and replaced with the new one. Meanwhile, both URLs work, perfect sollution. Especially as youll get a boost in rankings as the new URLs are visited.

    Another look at the SEO side of things, konverting ø to oe seems to be a far better alternative, I did a quick search on google which had some interesting results.

    Search for : sokemotor, 41900 hits Search for : soekemotor, 535000 hits

    The interesting part is that google highlights words printed with ø as hits… Looks like there is a some dirify running at google aswell. However, how interesting this might look, I couldn’t get any results matching oe by searching for an ø, :/

    It sounds more logical to me to oe an ø and aa an a, and since it doesn’t look like it has any impact on the search engine one might aswell go with the one looking most logical for the physical user, after all we tell our customers that the dirify is called “Human redable URLs”, :D

    As a final note on the dirify, a crafty SEO function i’ve been thinking of for my dirify , to spice things up abit would be the option to choose from hyphens or plus as word delimiters when creating the article. Sometimes a plus signs can be a killer on search terms, then again explaining the correct usage to the customers could probably be the same, :D

  9. October 18, 2005 by Roger Johansson (Author comment)

    Danne: Yes, that conversion is probably more historically correct, but I think it looks really silly and makes URLs harder to read.

    Kim: I did similar search tests with opposite results, so that is another reason I do not want to use “aa”, “ae” and “oe”.

    I’d like to switch this site over to using hyphens instead of underscores, but I do not want to risk incoming links breaking…

  10. Kim: Sorry to take the conversation back to the first comment but…

    I didn’t know it was a fact that hyphens were the preferred method of separating words in a URL. I’ve always found underscores to be easier to read. Do Search Engines definitely prefer hyphens?

    Could you post me to some more information on this, (if you know any) as I’ve been searching for an expert’s opinion on the pros and cons of hyphens and underscores?

  11. October 18, 2005 by Roger Johansson (Author comment)
  12. Thanks Roger. I’ve got to go make some changes to my URLs!

  13. October 21, 2005 by Joerg Petermann

    Special thanks for your tip Roger, it helps to find a solution for me too.

  14. Use these for Aelig and aelig (Æ and æ):

    "\xc3\x86" => 'A',    # Aelig
    "\xc3\xa6" => 'a',    # aelig
    
  15. While I’m at it… If you want this to work from within MT as well, you need to edit tmpl/cms/edit_entry.tmpl

    I suggest you copy it to your exttmpl-directory (specified in mt-config.cgi). Then look for the JavaScript function dirify.

    Change s = s.replace(/\s+/g, ‘_’); to s = s.replace(/\s+/g, ‘-‘); to go from underscores to hyphens. Add whatever other conversions you want immediately after var s = str.toLowerCase();, e.g. for Æ and æ:

    s = s.replace(/Æ/g, 'a');
    s = s.replace(/æ/g, 'a');
    

    The dirify function is in mt-static/mt.js as well, so you should probably apply your changes there as well (weird that they replicate the JavaScript function inline in edit_entry.tmpl).

    (Roger, I wanted to use Markdowns code span, but your CSS doesn’t support it? Feel free to edit/format this entry for legibility).

  16. October 21, 2005 by Roger Johansson (Author comment)

    Pål: Thanks, I’ll take a look at that.

    I know about the Markdown problem. I just can’t make the backticks for code replacement work in comments. It works fine when I use the Markdown plugin for BBEdit though. Weird, and quite frustrating.

  17. Thank you for this entry. You just made my life a lot easier and less frustrating.

Comments are disabled for this post (read why), but if you have spotted an error or have additional info that you think should be in this post, feel free to contact me.