My Kingdom for an Umlaut

Community, Project News, Pulse March 12th, 2008

posted by Jeff Standen

I think we’ve solved most of the issues with special characters (such as umlauts) in Western, single-byte languages for Cerb4.

Cerb4 Build: 563

Due to a few lingering obstacles for full internationalization, we’re still using ISO-8859-1 encoding. That should completely cover English, German, Spanish, Italian, Norwegian, Icelandic, Swedish and others. It should mostly cover Dutch and French (with the exception of a few rare characters).

Warning! Dear Non-Techie friends, the rest of this post gets a bit nerdy.

The recent issues were caused by a couple things:

  • META/Content-Type: We were a bit over-eager and had the browser using Unicode (UTF-8) encoding while the data was still Western Latin (ISO-8859-1). Display-wise this is entirely fine because of backwards compatibility — but when input, like forms, was sent from the browser to PHP, it could include multi-byte characters that the database would split up into seemingly random single byte characters. This affected things like umlauts and the British pound sign, not just fully multi-byte languages like Japanese, Chinese and Russian.
  • Ajax/XHR: Modern browsers default their XHR requests (how Ajax happens) to UTF-8 encoding when no Content-Type header is provided. This is why even with ISO-8859-1 encoding, umlauts would break on dynamic functionality in the helpdesk (ticket peek, reply, templates).
  • UTF-8 vs Unicode: Unicode is a variable byte encoding which overlaps with ISO-8859-1 completely on the first byte. UTF-8 overlaps with ISO-8859-1 over the first 7 bits (which is basically all the characters printed on the keys of a standard US Keyboard, including the SHIFT characters). Extended characters, such as umlauts, are expected in a second UTF-8 byte (0xC3BC vs 0xFC).

We’d love to switch over to UTF-8 immediately and support almost every language, but here’s what’s in the way:

  • PHP6 is on the way with native Unicode support for all strings. This is much cleaner than anything we can do with the mbstring extension.
  • mbstring function overloading (e.g. substr() -> mb_substr()) can simulate native UTF-8 strings with PHP5, but it requires php.ini changes. These changes will usually affect all PHP scripts on your webserver. Under ideal conditions we can control this with Cerb4 and only affect our script. However, real-world conditions are rarely ideal and this would make for a terrible minimum requirement.
  • Our database abstraction (MySQL, PostgreSQL, etc.) makes UTF-8 support in the database layer more complicated. This would be much easier if we knew we were always dealing with MySQL 4.0+. We may need to make some tough decisions here (though they’d allow us a lot more efficiency on things like fulltext index searching versus our current watered-down alternatives).
  • Converting existing databases to UTF-8 isn’t a one-click process. Ideally we’ll only have to help people convert their existing database if they need UTF-8 support. It can also be the new default. Our current goal is to make sure we can support UTF-8 optionally without breaking every existing Cerb4 helpdesk to get it. Data-wise we’re probably fine here, but we’d run into the same issues as described above until the database was converted to UTF-8. The MySQL default is ISO-8859-1 (which it dubs “latin1”).

UTF-8 hangups aside, this should fix the main issues with Cerberus Helpdesk for our friends in Germany, Austria, France, Belgium, Italy, Sweden, Norway, Spain and the Netherlands (who have been rightfully banging down our door about it!)

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]


5 Comments to “My Kingdom for an Umlaut”

  1. Cerberus Helpdesk - Blog » Blog Archive » Progress Report: Time Tracking, Support Center, Translations | March 14th, 2008 at 2:10 am

    […] We finally sorted out the Western Latin special character issues that have been plaguing Cerb4.  This was a big step to […]

  2. Mick | March 15th, 2008 at 9:25 am

    Greetings Jeff,

    IIRC, one of the few “rare” characters missing in ISO-8859-1 is the EURO-Symbol (€), which is the currency for most European countries. ISO-8859-15 is covering this symbol.

    Just my 2 cents, Mick

  3. Jeff Standen | March 15th, 2008 at 2:55 pm
    Hey Mick!

    Yeah, the main goal is to handle everything in UTF-8 so we won’t have to choose one encoding or another (since many helpdesks span several localizations, and it makes no sense to keep converting the database encoding).

    Thanks for the comment. :)

  4. Christian | March 18th, 2008 at 3:00 pm

    Hi Jeff,

    perfect that you take care of this. thanks. two things:

    1) I think not use UTF-8 is really not the correct way. The solution without UTF-8 is not a solution, it is a bad workaround. PHP can handle UTF-8 better than the most people think. We even send all our mails in UTF-8 and there are no problems. And: chinese etc. is impossible without it.

    2) I updated our system today. öäü in the mailtext wordks. But in the subject it shows ??? since before.

    best regards,
    Christian

  5. Jeff Standen | March 18th, 2008 at 5:40 pm
    Hey Christian!

    1) We’re only using ISO-8859-1 during the transition to UTF-8. We will be doing UTF-8. The main thing obstacle is just adopting MySQL 4.1+ as a minimum requirement.

    2) I’ll see if Q/A can reproduce that. If you get a chance, send your message source to my e-mail address (jeff AT webgroupmedia DOT com).

    Thanks for your comment! :)

Leave a Comment