Changes between Initial Version and Version 1 of CharacterSets


Ignore:
Timestamp:
Nov 19, 2005, 11:00:42 AM (16 years ago)
Author:
Dominic Hargreaves
Comment:

Impoirt from old wiki

Legend:

Unmodified
Added
Removed
Modified
  • CharacterSets

    v1 v1  
     1This node is for collating information relating to character set, especially UTF8, support in OpenGuides and supporting software. Content added here will be used in the OpenGuides distribution and thus released under the Perl licence.
     2
     3The main issues we need to think about:
     4
     5=== General and development plan ===
     6
     7Initially we can get things going by implementing a charset config var which defaults to iso-8859-1. We can then make sure that the content-types are output correctly based on this.
     8
     9CGI::Wiki has the 'charset' config directive. Just hook that. -- tom
     10
     11Once we know how to tell databases to operate in utf8 mode (and other conditions are satisfied) we can have the install routines try and detect a suitable db, then set the default charset to utf8 in wiki.conf, initialize the database in an appropriate way and so on. Part of this logic should be in CGI::Wiki of course. If utf8 support cannot be enabled it should fall back to iso-8859-1 and warn the user.
     12
     13Specifics are below.
     14
     15=== Database support for charset ===
     16
     17Not all databases support storing UTF8 data. Apparently MySQL 3 doesn't, but [http://dev.mysql.com/doc/mysql/en/Charset-Unicode.html MySQL 4] does.
     18
     19SQLite seems to mostly work with UTF8 if compiled with UTF8 support. This is all a bit vague though. This is I think true for [http://www.sai.msu.su/~megera/postgres/utf8.html PostgreSQL] too but I can't find any definitive statements about which versions.
     20
     21(add more information about SQLite and PostgreSQL here)
     22
     23I've found that DB support for charsets is to annoying, and you end up doing things the bad way for compatibility reasons anyway - I sugggest strongly that you just store the utf8 byte-encoding of all strings in the database, and decode them again when they come out. Consider the database as a device that stores _bytes_ and knows nothing about characters, or you'll go mad. -- tom
     24
     25=== Content-Type output in meta tags, XML declarations, and HTTP headers ===
     26
     27Need to make sure we can override the Content-Type set by apache.
     28CGI::Wiki::Plugin::RSS::ModWiki will need modifying as part of this.
     29
     30Easy. We send the content type as 'text/html' at the moment. Set it to 'text/html; charset=utf-8'. -- tom
     31
     32=== Search/index compatibility ===
     33
     34No idea about this yet - restricting our investigations to Plucene though I guess.
     35
     36Again, utf-8 encode the thing you're searching for, and search for the bytes? Index the byte encodings of key words? --tom
     37
     38=== New enough perl ===
     39
     40AFAIK we are best off mandating perl 5.8 for UTF8 support. Then things should be fairly transparent at the code level. We should probably leave the code as iso-8859 and use escaped representations of characters for the time being: ie \x{nnn}. This will mean the code will still be valid for perl 5.6.
     41
     42Can someone who knows more about perl 5.6 support tell us whether this is an actual requirement, and what problems we are going to run into with perl5.6?
     43
     44Don't go there. Just don't. CGI::Wiki will do charset support on 5.8, and fall back to the old 'I'll just assume latin-1' method otherwise. utf8 in 5.6 is too painful to consider. -- tom
     45
     46Reference for unicode support in perl 5.8 is at [http://www.perldoc.com/perl5.8.4/pod/perlunicode.html perldoc.com]
     47
     48I highly recommend Mark Fowler's UTF8 talk - http://www.twoshortplanks.com/talks/yapce2004/perlandutf8.ppt  -- tom
     49
     50=== Templates ===
     51
     52Requires Template Toolkit 2.14 released in October apparently.
     53
     54[http://www.template-toolkit.org/pipermail/templates/2003-November/thread.html#5322 mailing list link] describes hacks for pre-2.14 installs but we probably don't want to go there.
     55
     56TT 2.14 allows you to set a binmode on the output filehandle. You can work around the lack of this by outputting to a scalar and encoding that to bytes before-hand, but it's icky. The thing that TT stil doesn't allow is non-latin 1 in hash keys and things - it's a limitation of the XS stash. It's not too cripling, though. --tom