User User name Password  
   
Sunday 8.11.2009 / 01:25 AM
Search AfterDawn.com:        In English   Suomeksi   På svenska
afterdawn.com / profiles / personal safety valve / Blog archive /
Home Blog Pictures Shoutbox Links

January 2009

Two days of UTF-8 happiness

(23 Jan 2009 6:16)

Had "one of those days" .. actually two of them. With the joys of trying to make sense to the swamp called character encoding.

Now, to explain the situation, you must understand the difference between ISO-8859-1 (and its sister, ISO-8859-15.. both of which are typically/roughly equivalent to the database-level encoding of 'latin1') and UTF-8 / Unicode. Latin1 basically is a character set that is built for Western languages and is limited to 255 characters, but can be used to store and represent pretty much all of the Western European languages correctly (English, Finnish, Swedish, German, French, etc).

As our site is translated to three languages -- English, Finnish and Swedish -- we obviously can survive with latin1/iso-8859-15 pretty nicely, and have done so for almost a decade now. But as we're about to expand to new language areas, the wonderful world of fancy character codes has been forced upon us. Namely, Russian and Cantonese.

The most straight-forward way to make the site usable also in non-Western languages would be to just convert the whole site to use UTF-8 -- convert the MySQL, its tables and all of its columns to use utf8 and also to switch the application server (ColdFusion clone, BlueDragon, that we use on our site) to use UTF-8 internally and to change all the HTML page headers to set the encoding to UTF-8. Now, as most of our small dev team is tied to our upcoming "v4" site version, such project is not feasible, as it would need too much testing, etc to get it done properly and it would also postpone the launch and development of the v4. And our only goal for now is really to allow volunteer translators to be able to enter translations of our site's elements to our database -- until that (and v4) is done, there's no point to rush the "lets convert the whole site to UTF8" project.

Ok, so, I basically have just two templates that need to use UTF-8 and then pass it to our database in correct format. But as our application server's default encoding is set to ISO-8859-15 and our database server is using latin1, this needed some tweaking.

First of all, we have a database, called 'software' that then contains a table called 'software_information'. Both, the server and the table, are by default, set to latin1. Same goes for our JDBC database connection from app server to MySQL.

Now, I added a new column to the 'software_information' table called 'long_description_russian' and set that column -- and that column only -- to be in UTF-8 encoding. Basically using HeidiSQL to add the column and then manually making sure it is in utf8:

ALTER TABLE software_information MODIFY long_description_russian TEXT CHARACTER SET utf8;

..alright. Now, the templates that allow translators to enter data, need obviously to tell the browser that they're presented in UTF-8. That's a non-issue, really, just force the app server to process everything in UTF-8 by adding this data to both of the templates:

<cfprocessingdirective pageencoding="utf-8" />
<cfcontent type="text/html; charset=utf-8">

First line tells BlueDragon/ColdFusion to process all the data internally in UTF-8 on that particular template and the second one sends a header to the user's browser to make sure that the page is encoded on the user's browser as a UTF-8 page.

Tried that. Entered Russian text to the form and submitted it. The "action page" that took the form data, showed gibberish instead of Russian. Damn. More googling. Apparently adding this to your <form> should help:

accept-charset="UTF-8"

thus making the <form> to look something like this:

<form action="addTranslation.cfm" method="post" accept-charset="UTF-8">

Testing.. Still gibberish. More googling and couple of cigarettes later, found out that I need to add also these two lines of code to both templates:

<cfset setEncoding("form","utf-8")>
<cfset setEncoding("url","utf-8")>

Testing again.. Yay! Works, the data is passed in correct form from the input template to the action template. Now, added a basic SQL statement to insert that column (and that column only) to the database to the correct place:

UPDATE software_information
SET long_description_russian='#userSubmittedTranslation#'
where softwareID=#softwareID#

...testing that. No errors. But looking at the data back from the database -- gibberish again. Damn. Couple of tests and couple of hours and many cigarettes later, having tried pretty much everything, ranging from storing the data in binary format to various other tweaks, no success. Then decided to look at our JDBC connection string.. It stated characterEncoding=latin1 ... just a guess, but we probably will mess up the whole site by touching to that. Thus, decided to create a copy of that db connection, called 'software2' with same arguments, connecting to the same 'software' database we have on MySQL server as previously, but changing the connection string to use utf8 encoding instead.

Testing... Still gibberish.. More googling. Found out that adding couple of instructions for the MySQL before my query could help, but as BlueDragon freaks out when queueing queries with semicolon (;) had to hack it a bit. So, the update statement changed to this:

<cftransaction>
<cfquery name="setToUTF8" datasource="software2">
set names utf8
</cfquery>
<cfquery name="addTranslation" datasource="software2">
UPDATE software_information
SET long_description_russian='#userSubmittedTranslation#'
where softwareID=#softwareID#
</cfquery>
</cftransaction>

Yay! Works, data goes to the db correctly. Now, how to get it out properly... Easy, just add the 'set names utf8' before your select statement too -- and make sure that the select statement goes to the 'software2' table again, using the UTF8 connection string rather than latin1 connection string.

Now, this is an ugly hack for something that is completely temporary, but works nicely until we have spare time to actually go and do the inevitable, convert our whole site to use UTF-8.

[ Post comments ]

 

Digital video: AfterDawn.com | AfterDawn Forums
Music: MP3Lizard.com
Gaming: Blasteroids.com | Blasteroids Forums | Compare game prices
Software: Software downloads
Blogs: User profile pages
RSS feeds: AfterDawn.com News | Software updates | AfterDawn Forums
International: AfterDawn in Finnish | AfterDawn in Swedish | download.fi
Navigate: Search | Site map
About us: About AfterDawn Ltd | Advertise on our sites | Rules, Restrictions, Legal disclaimer & Privacy policy
Contact us: Send feedback | Contact our media sales team
 
  © 1999-2009 by AfterDawn Ltd.