How the ACG Went Multilingual

this is a developing document

Introduction

At the end of 2001 I took some time to tinker with the language negotiation feature of http. If you've ever stumbled into the "Language" option in your web browser and wondered what it was for, this is in part what this essay is about. You can set your preferred language in the browser and it will communicate this preference to a web site and if a translation is available you'll get it automatically. This article overviews the steps I went thru to get the feature working for the Abyssinia Cyberspace Gateway's Ethiopia page. The basic process is very simple and should be applicable most anywhere.

Server Setup

For any of this to work the first step will be to have the web server that dishes up your web site configured to look for Amharic or other translations of your web pages. I've worked only with the Apache web server which does not recognize Ethiopian languages by default. I expect this is likely the case for most any other web server widely used. You may test the setup at your web hosting company by following the steps in the next section.

Unless you manage the server yourself you will have to email or otherwise contact the web support staff at your hosting company to request that your language of choice be added if unavailable. An example request letter follows:

Dear Tech Support, Please add the following language options to the "conf/httpd.conf" file in the section shown. Support for these languages is a publishing need of my web site and likely that of other customers as well: <IfModule mod_mime.c> : : AddLanguage am .am # Amharic in iso639 AddLanguage ti .ti # Tigrinya in iso639 : : </IfModule> I have confirmed that the languages are not already there or "mod_mime" is not being used -in which case please also activate it. Thank You, <your name here>

Of course the "mod_mime" module must be in use by the server (it is by default in Apache) also. The lines map the letters "am" that would be set in a browser for the "Amharic" language preference to the file extension ".am" and likewise "ti" for Tigrinya. The file extension could have been ".amharic" or anything else but sticking with the same terms that web browsers must use (ISO639-1 language codes) is the norm and should make the web support staff less reluctant to add strange new languages ;-) The changes do not go into effect until the web server is restarted.

Language Targeted Files

The rest is up to you. You will need to rename your Amharic web page so that the file name ends with ".am". For example:

Old NameNew Name
file.htmfile.htm.am
index.htmindex.htm.am
index.htmlindex.html.am
default.aspdefault.asp.am

Likewise your English version would be renamed with ".en" and so on. Now when a person clicks on a link for "file.html" Apache will look for it, if "file.html" is really there it is used as-is, no language checking occurs. When "file.html" is not there, then Apache will search the directory for a version in the web visitor's preferred language. A person may have any number of preferred languages set (a first choice, second choice, etc.) and the first one found in the list is taken and sent back to the browser. If none match, the default language to use will likely be the ".en" version if that is the norm at your web hosting company.

The default language a web server uses can again be reset in the web server configuration file. Changing the default server language would impact every web site in the scope of the server (this might be in the hundreds) and you wouldn't want to change it to anything other than the primary language of the majority of the web sites hosted. The Ethiopia page for instance defaults in English as would every page accessed under the server. The default would not be changed to Amharic unless I was assured that Amharic was the primary language of the visitors coming to the page and that my majority Amharic speakers also had a Unicode font installed so that they could read the page (quite another battle).

A visitor will now get the right page, as per their browser configuration, when accessing:

http://www.abyssiniagateway.net/ethiopia/ It is a good idea to make links to alternative translations from each version. The other translations may be accessed directly using the fully qualified file name:
Amharic:http://www.abyssiniagateway.net/ethiopia/index.html.am
English:http://www.abyssiniagateway.net/ethiopia/index.html.en

Maintenance

Server configuration and renaming of files was the easy part. With two or more versions of the web page available, my maintenance work doubles. Every time an update occurs (delete a link, add a link, change a graphic, etc) it has to be done in at least two places. It could be easy to get out of sync.

Naturally it would (should) be easier if I could maintain only one file and auto generate the translated versions as needed. Fortunately the ACG pages are very simple, they are little more than a list of links with headings and subheadings. The pages were originally designed before the arrival of Netscape and should still be comprehensible to 1.0 browsers and still look good in text only browsers. Given this, my task is fairly simple.

To combine the two documents into one I took advantage of the "<span>" tag to mark up the translations. Generally all translations will still use the same link, so the HTML appears like:

<li><a href="new_page.html"><span lang="en">English Link Title</span><span>&nbsp;&nbsp;</span><span lang="am">Amharic Link Title</span></a></li>

Note that the <span>&nbsp;&nbsp;</span> sections above are just there so that the source document can be more easily read in HTML viewers. The spans and spaces will be stripped out at filter time. In a few cases different the target page might also be language specific, so the lang=".." attribute is applied to the "<li>" instead:

<li lang="am"><a href="new_page_amharic.html">Amharic Link Title</a></li>
 
<li lang="en"><a href="new_page_english.html">English Link Title</a></li>

With translations marked up appropriately what remains is to run an XML filter on the document and exclude the languages the undesired translation (e.g. if generating the English document you exclude Amharic). And now the catch: When working with two or more completely different language and scripts the matter of sorting rears its ugly head. To avoid favoritism I try and keep the ACG links alphabetically ordered. Wrapping links around translations in the approach shown here, the ordering can be as per one language only. So after the extraction processes some sorting is due for at least one language.

Or both, once an auto sorting feature is built into the filter the source document can be sloppily maintained knowing that the sorting will be taken care of for you. The result is the filter "makeacg.pl" which processes the source page "index.src.html". Use as per:

% makeacg.pl index.src.html am > index.html.am % makeacg.pl index.src.html en > index.html.en

The makeacg.pl script is tailored specifically for the meager needs of ACG pages, expect that you'll have to refine it for more complex sites. One thing it does not handle are nested <ul> structures. Presently it ignores one level of nested unordered lists, these I have to correct by hand after the document images are generated. Maybe I'll fix this later maybe not, 95% of the work is done for me so and I'm fairly content with that.

XHTML vs HTML

XHTML is an XML compliant strain of HTML 4.0 (maybe) and is likely the future of HTML itself.. The choice of XHTML over regular HTML was simply a step to prepare for the future and to be able to use XML tools now on HTML documents. Other than the header data that makes the document XHTML compliant I've not added anything new that wasn't in HTML 3.0 (or perhaps even earlier). In fact the HTML generated from the XHTML document does not go beyond the 3.0 level.

HTML parsers are much more forgiving than XML parsers and to use an XML parser on the XHTML source file the document had to become "well formed". The W3C validator was helpful for finding where changes had to be made. The following are cases where I had to revise the original ACG Ethiopia page to become a valid XHTML document.

  1. Add XML header and XHTML DOCTYPE.
  2. Tag text should be lower cased, at least mixed casing can not be used.
  3. </li>s are required.
  4. Non-closing tags must end with "/>". For example <br> becomes <br />.
  5. The alt="..." attribute is required for images (<img ...> tags).
  6. & can not be used in a URL (common for CGI links) and must be escaped as &amp;
  7. The <nobr> tag is no longer defined.
  8. The <hr> tag can not be used within <h[1-6]> tags.

The last two problems I chose to ignore knowing that web browsers could handle it anyway (again I'm generating HTML 3.0 level output) and the result was going to look better than the alternatives.

HTML vs XML

There could well be some tricks with XSL and CSS that would have allowed the source document to be used directly while hiding the undesired language. I don't think the sorting matter could be easily solved though (JavaScript?). I still want the ACG pages to be available to a reasonable lowest-common-denominator web browser. Accordingly this means staying with HTML 3.0 for a while longer. My experience with XML as a web page document language has been rather miserable. The world of XML rendering is a bigger mess than HTML ever was. Each browser, and version, is going produce a different result. You can never be sure of what your web visitor is going to see, if anything. Trying to use XML derived technologies (XSL, etc) only makes the matter worse.

Aside from client side XML processing matters was XHTML essential? No, not at all. In fact XML parsers are less mature, less featureful, flexible and less friendly to use than HTML parsers which have been around a lot longer. The "makeacg.pl" would have been much easier to develop had I stuck with an HTML basis. On the other hand, I don't regret having been forced to make the pages "well formed".

Translation Issues

The ACG Ethiopia page has been an English only document until now. Translating link titles into Amharic a few issues came up. The policies I wound up following are:

  1. If a site already has an Amharic translation for the site name, use it.
  2. If an Amharic translation was just two awkward, stay in English.
  3. When different spellings are found for the same word (for example 4 spellings of "bEtekristiyan" go with what you know to be canonically correct. If you don't know, stick with what the site is using.
  4. Ask the web masters to approve your translation and offer alternatives.
  5. Though it breaks sort order put English links at the end of lists on Amharic page to deemphasize English and encourage web masters to submit Amharic translations.