Monday, February 19, 2007

One Encoding to Rule Them All

In Java, all strings are Unicode strings. The Swing GUI components also support more advanced features of Unicode like Arabic right-to-left writing. It is almost trivial to make a GUI application that support all possible scripts from Katakana, Cyrillic alphabets and Korean Hangul to thousands of Chinese characters and various right-to-left scripts. The only thing you need to take care of is that when you access external interfaces, the data is passed in Unicode.

In traditional GUI applications, this means reading and writing text files in Unicode instead of the default encoding. Here's how you open a file for writing in UTF-8 encoding:

File file = new File( fileName );
FileOutputStream fos = new FileOutputStream( file );
OutputStreamWriter osw = new OutputStreamWriter( fos, "UTF-8" );
BufferedWriter writer = new BufferedWriter( osw );

The difference to opening a file in the default encoding is really small:

File file = new File( fileName );
FileWriter fw = new FileWriter( file );
BufferedWriter writer = new BufferedWriter( fw );

The difference in reading a file is equally small. Not too difficult, huh? Turns out that it is for those who haven't confronted foreign character sets before. Before starting to roll my own flashcard program, I tried some free ones, some of them written with Java. Turned out that many of them didn't support Unicode. Well, that's one way of making sure that communists don't use your program.

don't support unicode

You may also want to write unit tests to ensure that the Unicode support really works. To embed Unicode into the source files of the test cases, you need to signal the compiler that the source files are encoded with UTF8.

C:\>javac -encoding UTF8

Ditto with the editor (in this case, Eclipse)

eclipse and unicode

With web services, it's a bit different issue, since there are so many external interfaces. First of all, the source file UTF8 switch must be embedded to the Ant build file (assuming that you use Ant):

<javac destdir="${build.dir}"

Now, let's look at the generated XHTML page. Firstly, to make it valid XML, you have to define the encoding in the first line that describes the nature of the document as XML.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
<html xmlns="" xml:lang="FI" lang="FI">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>

The second instance deserves more explanation. When a web page is delivered from a server to a browser, it is delivered by the HTTP protocol. The HTTP protocol includes HTTP headers that describe the package. Usually, the default values of the headers are just fine and you can forget them - except for the character encoding. The <meta> element is trying to say that the document IS encoded in UTF8, not matter what the HTTP headers say.

Unfortunately, browsers tend to believe HTTP headers rather than the <meta> tag. This code snippet is from the Perl exercise work where I first run into this problem:

# The final part of this line (charset=UTF-8) is
# absolutely essential. Without that line,
# the UTA webserver somehow convinces the
# browser that the encoding isn't UTF8,
# even if the .html says otherwise.
print "Content-type: text/html; charset=UTF-8\n\n";

In servlets, the equivalent code is:

public void doPost(
HttpServletRequest request,
HttpServletResponse response)
throws ServletException, IOException
response.setContentType("text/html; charset=UTF-8");

PrintWriter out = response.getWriter();
out.println( "<html>" );
out.println( "<head>" );
out.println( "<title>Helo World</title>" );
out.println( "</head>" );
out.println( "<body><h1>Helo World</h1></body>" );
out.println( "</html>" );

In addition to writing data in UTF8, we also need to read characters typed by the user. This is done by setting the encoding propery of the request object (the object that contains the form data from the user):

request.setCharacterEncoding( "UTF-8" );

The Java documentation says that the encoding of the request object must be set before you pull any data that the user just typed. Sometimes the servlet is provided by some third party, and you can't set the encoding just-in-time before reading data. For example, if you use Spring's SimpleFormController to structure to validate forms, Spring automatically reads the form data to a more convenient structure before giving it to you. In these cases, you have to configure a filter that is run before the servlet. Filters are classes that are modify the input or the output of a servlet. This snippet is written to the deployment descriptor, web.xml.



Finally, if you use JSP to generate the final pages, you need to put the following headers to the JSP files.

<%@ page contentType="text/html; charset=UTF-8" %>
<%@ page pageEncoding="UTF-8" %>

Now, after shouting 11 times "USE THE DAMN UNICODE!!!" the system finally believes. We're still waiting for the one encoding switch to rule them all.

The encodings not so blest as thee,
Shall in their turns to limitations fall;
While thou shalt flourish great and free,
The dread and envy of them all.

Rule, Unicode! Unicode, rule the scripts:
Britons never shall need more glyphs.

No comments: