Recently, I had the need to take the journey into building Web applications for multiple languages. Internationalization - i18n for short (there are 18 letters between the "i" and the "n" in "internationalization") - is the process of building Web applications for multiple languages and/or locales. The W3C defines internationalization as "proposing and coordinating any techniques, conventions, guidelines and activities within the W3C and together with other organizations that allow and make it easy to use W3C technology worldwide, with different languages, scripts, and cultures."
While this definition is good and comprehensive I needed to only focus on a specific level of internationalization. This article describes the application I was working with, the pitfalls that I ran into, and the solutions that I found to make my application work.
I've been working for a small learning firm here in Nashville for a couple of months. We recently started an international Web-based learning project designed to teach Chinese students the English language and English students the Chinese language. There is more to the project than this, but suffice to say this project will be the largest e-Learning venture ever attempted. The courseware used to teach both Chinese and English students is built almost entirely in FlashMX. Without going into a lot of detail on how the courseware is built, it might be helpful to know that the Flash pieces pull in dynamic textual content from XML files. It's these XML files that hold the various "translations" of English and Chinese content. To support the storage, editing, organizing, and deploying of the multilingual XML files I employed an application I had previously written using ColdFusionMX, Microsoft Access, and of course HTML. There was one large difference between the application I had written and the project I was going to use it for: The application worked with English/Spanish and I needed it to work with English and simplified (GB) Chinese. While I had the application working well with English and Spanish my new charge was to get it functioning with Chinese.
TRIAL AND ERROR
It became an almost trial and error process going through many different scenarios in an attempt to get the application working with Chinese which is not a single-byte language. Chinese glyphs are multibyte and the amount of space needed for each glyph depends on the glyph. Since my application needed to support both single-byte and multi-byte character sets it was obvious that encoding the translations as Latin-1 was out of the question. Unicode or more specifically UTF-8 was to become my friend. It was here that the problems began.
ATTEMPT NUMBER ONE - MS ACCESS
The application itself - that is, the front end - was not really the problem. It's really easy to set the character set encoding on an HTML page to allow various languages to display appropriately in a browser (this assumes of course that the clients browser has the appropriate language set installed on their machine). This is accomplished with the HTML META tag. Page encoding can also be modified by the application server, in this case ColdFusionMX, using the CFCONTENT tag. By default ColdFusionMX returns character data using the Unicode UTF-8 format but if you need to support a different character set specifically, you can set this using the CFCONTENT tag. So what was the problem I ran into using MS Access? Ironically, the problem is not with Access and blame can't really be layed on CFMX either. However, the problem does stem from part of ColdFusionMX, specifically the JDBC drivers that CFMX uses to connect to Access DB's. The drivers do not support Unicode data so even though your HTML forms and your subsequent ColdFusion queries are sending UTF-8 FORM data to the database, the database does not receive the data correctly. This little problem is on Macromedia's to-do list in regards to the next release of ColdFusion which I can't comment on.
ATTEMPT NUMBER TWO - MYSQL
On to another database. We already had a machine set up with mySQL and I have worked with mySQL on several occassions (this Web site runs on mySQL) so I started to investigate the possibility of running the application on the latest release of mySQL. What I found, was that mySQL's implementation of multiple character-set support is relatively poor. The database does claim support for certain character sets, one of which was not Chinese. Additionally, the database does not fully support Unicode data. Future releases of mySQL (see Character set handling, with full Unicode support, http://www.mysql.com/products/mysql/index.html for more information) are slated to support the full implementation of Unicode data in the UCS2 and UTF-8 encodings. I did perform a number of tests on a sample mySQL database using UTF-8 and none of the tests worked.
THE GOLDEN PARACHUTE
Just like CEO's who have a "golden parachute" in regards to their business ventures, SQL Server became mine. Built-in to SQL Server 2000 are Unicode specific datatypes. These include: ntext, nchar, and nvarchar. For more on these datatypes see: this Web site. After porting the database from MS Access to SQL and ensuring the encoding on the HTML pages was UTF-8 I was good-to-go right? Almost. For some reason, when inserting data to a column specified with a Unicode datatype the UTF-8 encoding was lost. For a moment I thought I was dealing with a CFMX database driver problem again. As it turns out, there is an odd behaviour between CFMX and SQL Server 2000 regarding the Unicode datatypes. Unless you place a literal "N" in front of all Unicode column data in your CFQUERY inserts and updates the encoding is lost. Macromedia released a technote regarding this behaviour but doesn't offer any real reason why the extra "N" is necessary. Thus, if you plan on using the built-in Unicode datatypes that SQL Server 2000 provides with CFMX you'll need to pay special attention to this small detail.
After correcting all my queries (inserts and updates) my application was working. Chinese data was displaying correctly in my browser and HTML forms, and was making it to the database as well. Even the exported XML worked correctly without a programming change. Armed with a newfound internationalization knowledge I feel much more comfortable creating applications for multiple languages.
About this post:
This entry was posted by Aaron West on June 3, 2003 at 4:22 PM. It was filed in the following categories: Programming. It has been viewed 4666 times and has 1 comments.