Version 1 (modified by Dominic Hargreaves, 16 years ago) (diff)

Impoirt from old wiki

This node is for collating information relating to character set, especially UTF8, support in OpenGuides and supporting software. Content added here will be used in the OpenGuides distribution and thus released under the Perl licence.

The main issues we need to think about:

General and development plan

Initially we can get things going by implementing a charset config var which defaults to iso-8859-1. We can then make sure that the content-types are output correctly based on this.

CGI::Wiki has the 'charset' config directive. Just hook that. -- tom

Once we know how to tell databases to operate in utf8 mode (and other conditions are satisfied) we can have the install routines try and detect a suitable db, then set the default charset to utf8 in wiki.conf, initialize the database in an appropriate way and so on. Part of this logic should be in CGI::Wiki of course. If utf8 support cannot be enabled it should fall back to iso-8859-1 and warn the user.

Specifics are below.

Database support for charset

Not all databases support storing UTF8 data. Apparently MySQL 3 doesn't, but MySQL 4 does.

SQLite seems to mostly work with UTF8 if compiled with UTF8 support. This is all a bit vague though. This is I think true for PostgreSQL too but I can't find any definitive statements about which versions.

(add more information about SQLite and PostgreSQL here)

I've found that DB support for charsets is to annoying, and you end up doing things the bad way for compatibility reasons anyway - I sugggest strongly that you just store the utf8 byte-encoding of all strings in the database, and decode them again when they come out. Consider the database as a device that stores _bytes_ and knows nothing about characters, or you'll go mad. -- tom

Content-Type output in meta tags, XML declarations, and HTTP headers

Need to make sure we can override the Content-Type set by apache. CGI::Wiki::Plugin::RSS::ModWiki will need modifying as part of this.

Easy. We send the content type as 'text/html' at the moment. Set it to 'text/html; charset=utf-8'. -- tom

Search/index compatibility

No idea about this yet - restricting our investigations to Plucene though I guess.

Again, utf-8 encode the thing you're searching for, and search for the bytes? Index the byte encodings of key words? --tom

New enough perl

AFAIK we are best off mandating perl 5.8 for UTF8 support. Then things should be fairly transparent at the code level. We should probably leave the code as iso-8859 and use escaped representations of characters for the time being: ie \x{nnn}. This will mean the code will still be valid for perl 5.6.

Can someone who knows more about perl 5.6 support tell us whether this is an actual requirement, and what problems we are going to run into with perl5.6?

Don't go there. Just don't. CGI::Wiki will do charset support on 5.8, and fall back to the old 'I'll just assume latin-1' method otherwise. utf8 in 5.6 is too painful to consider. -- tom

Reference for unicode support in perl 5.8 is at

I highly recommend Mark Fowler's UTF8 talk - -- tom


Requires Template Toolkit 2.14 released in October apparently.

mailing list link describes hacks for pre-2.14 installs but we probably don't want to go there.

TT 2.14 allows you to set a binmode on the output filehandle. You can work around the lack of this by outputting to a scalar and encoding that to bytes before-hand, but it's icky. The thing that TT stil doesn't allow is non-latin 1 in hash keys and things - it's a limitation of the XS stash. It's not too cripling, though. --tom