Sneak Peek: UTF-8 / Translations
Community, Sneak Peek October 15th, 2008
posted by Jeff StandenWhen I designed the skeleton of Cerb4 about a year ago, I built it with the intention of flipping a switch for UTF-8 support at some point in the future. Full UTF-8 support is still a tricky thing with PHP5 since it deals with everything internally as latin1/iso-8859-1 (Western Latin) encoded. That means most functions aren’t multibyte friendly. On top of that, MySQL also defaults to latin1 (though it’s easy to switch multibyte support on).
We had been planning to wait until PHP6 for multibyte languages since it will finally support them natively, but there has been a UTF-8 patch for Cerb4 floating around the forums which actually works quite well. It cleanly builds off the placeholders I left in the code and mops up some lingering issues.
Earlier this week I had Dan@WGM and Mike@WGM take the patch and put it through the paces. They made some improvements (a simple toggle for latin1 or utf8) and sent the new version to me. I fixed some minor inconsistencies, and a templates issue that was corrupting multibyte display, and merged the patch into the official codebase.
That means with the next stable release we’ll *finally* have UTF-8 support. I wrote up some preliminary instructions for converting an existing database to UTF-8. You can take a peek, but please don’t follow along yet (unless you’re planning on running the daily development builds of Cerb4).
Special thanks to rogerger and LudovicLange from the community forums for their hard work and collaboration! ;) It’s great to have some outside eyeballs on the code.
-Jeff@WGM

(The text shown above was copied for testing purposes from The Kermit Project’s helpful UTF-8 Sampler)

I’m glad that we could have the utf-8 support in the future version. Please let me know if there is something I can help. Either testing or debugging. :)
Nice work on Cerberus!
Good to hear - the missing support for UTF-8 support in the templates is the show stopper here: Not using a license we paid for is not what we actually want to do!
Best Regards,
Marc
Hey Marc! This weekend I’ve actually been working on a visual translation editor for the ‘Helpdesk Config’ area too. We’ll use the positive mental boost from that to go through the grueling work of externalizing the rest of the text in the templates. ;)
@Rogerer Thanks! :)
If you don’t already have an account at our project portal (http://www.wgmdev.com/jira/) then sign up for one. Send me a private message on the forums with your login and I’ll add you to a special developer group.
@Jeff: I´m with you! ;-)
Hi There
We have just signed up to test if we could use this program. And since we are placed in denmark, im’ off course very interested in this. When using your site, does it use the patch mentioned ?
In production i will run this locally, then it should be possible to implement the fix ?
Have a nice weekend,
Henrik
Hey Henrik!
Our online demo is still using the latest stable version, which doesn’t use the patch yet (we usually don’t want to show off something different than what people will download). We’re planning to move the recent development to stable around Tuesday and we’ll update the demo then too.
With our On-Demand service we give people the option of running stable or development versions.
You could also run the development version locally to test with these instructions:
http://wiki.cerb4.com/wiki/Upgrading_to_Newer_Versions_of_Cerberus_Helpdesk_4.0
Thanks!
I get emails from my German colleagues. With each email, Outlook show it as 1 email (as expected) but in Cerberus I get 2 or 3 extra blank emails with “winmail.dat” attachments. Will these UTF changes help with that?
Hi,
Has there been any development in this case?
Best regards,
Haluk
@Haluk
UTF-8/Translations support has been added to the stable code branch. If you update to build #783, you should be able to start working with the translation tools.
@Wibbler
Hey there! That has to do with Microsoft’s RFC-defying TNEF format:
http://en.wikipedia.org/wiki/TNEF
http://www.dwheeler.com/essays/microsoft-outlook-tnef.html
I understand that you can’t tell every friend or customer to reconfigure their mail client. Our mail processing libraries just ignore TNEF, but it’s possible we could process them at the Cerb4 level if there’s enough demand. As far as I know there’s surprisingly little demand. ;)
If you don’t mind, and have an innocuous enough example, send the message source over to me at (jeff AT webgroupmedia DOT com). I’ll see what we can do.
The subject in e-mails is not converted.
I.E. tickets subject is OK (in UTF8), e-mail subject as shown in tickets view is full of ???:
From: Oleg Gawriloff
To: somemail
Subject: ????1
Date: Mon, 10 Nov 2008 11:29:12 +0200
There russian word test1 in russian.
@Oleg
Hey there! We’re looking into the issue. We currently handle decoding on the Subject header differently from the rest of the headers (for no good reason), and it should be a simple tweak to fix this.
Would be really nice to have Subject: line support for us utf8-crippled europeans. :-)
One more thing: on http://wiki.cerb4.com/wiki/UTF-8, you suggest running the command “iconv -c -f utf8 -t utf8″. Do you mean this “iconv -c -f latin1 -t utf8 instead?
@Nick
Hey there! The “iconv -c -f utf8 -t utf8″ line is correct. What that’s doing is stripping invalid UTF-8 characters. Earlier versions of Cerb4 already wrote some multibyte content to the database, so that voodoo is trying to preserve them while converting the tables from latin1 to utf8. Doing it the more convenient ways will convert a 2 byte single character (like an umlaut) into 2 characters taking 4 bytes (essentially re-UTF8 converting the UTF8 bytes from latin1). I know, it gets confusing. :)
When you follow those steps, it should recover any special characters that were already in the database (in the address book, in message bodies). The last little ‘iconv’ bit is just making sure any raw bytes we kept during latin1->utf8 are valid UTF-8 characters. If not they’re stripped. Since bytes are preserved, you can get unprintable character codes (mostly from spam) that cause problems when trying to import the .sql dump file again. Those bytes should have never been in the database in the first place.
Hope that makes more sense! :)
The latest stable release tonight (Build 809) should finally correct the UTF-8 e-mail header issues. Enjoy!