At the end of 2001 I took some time to tinker with the language negotiation feature of http. If you've ever stumbled into the "Language" option in your web browser and wondered what it was for, this is in part what this essay is about. You can set your preferred language in the browser and it will communicate this preference to a web site and if a translation is available you'll get it automatically. This article overviews the steps I went thru to get the feature working for the Abyssinia Cyberspace Gateway's Ethiopia page. The basic process is very simple and should be applicable most anywhere.
For any of this to work the first step will be to have the web server that dishes up your web site configured to look for Amharic or other translations of your web pages. I've worked only with the Apache web server which does not recognize Ethiopian languages by default. I expect this is likely the case for most any other web server widely used. You may test the setup at your web hosting company by following the steps in the next section.
Unless you manage the server yourself you will have to email or otherwise contact the web support staff at your hosting company to request that your language of choice be added if unavailable. An example request letter follows:
Of course the "mod_mime" module must be in use by the server (it is by default in Apache) also. The lines map the letters "am" that would be set in a browser for the "Amharic" language preference to the file extension ".am" and likewise "ti" for Tigrinya. The file extension could have been ".amharic" or anything else but sticking with the same terms that web browsers must use (ISO639-1 language codes) is the norm and should make the web support staff less reluctant to add strange new languages ;-) The changes do not go into effect until the web server is restarted.
The rest is up to you. You will need to rename your Amharic web page so that the file name ends with ".am". For example:
Old Name | New Name |
---|---|
file.htm | file.htm.am |
index.htm | index.htm.am |
index.html | index.html.am |
default.asp | default.asp.am |
Likewise your English version would be renamed with ".en" and so on. Now when a person clicks on a link for "file.html" Apache will look for it, if "file.html" is really there it is used as-is, no language checking occurs. When "file.html" is not there, then Apache will search the directory for a version in the web visitor's preferred language. A person may have any number of preferred languages set (a first choice, second choice, etc.) and the first one found in the list is taken and sent back to the browser. If none match, the default language to use will likely be the ".en" version if that is the norm at your web hosting company.
The default language a web server uses can again be reset in the web server configuration file. Changing the default server language would impact every web site in the scope of the server (this might be in the hundreds) and you wouldn't want to change it to anything other than the primary language of the majority of the web sites hosted. The Ethiopia page for instance defaults in English as would every page accessed under the server. The default would not be changed to Amharic unless I was assured that Amharic was the primary language of the visitors coming to the page and that my majority Amharic speakers also had a Unicode font installed so that they could read the page (quite another battle).
A visitor will now get the right page, as per their browser configuration, when accessing:
Amharic: | http://www.abyssiniagateway.net/ethiopia/index.html.am |
English: | http://www.abyssiniagateway.net/ethiopia/index.html.en |
Server configuration and renaming of files was the easy part. With two or more versions of the web page available, my maintenance work doubles. Every time an update occurs (delete a link, add a link, change a graphic, etc) it has to be done in at least two places. It could be easy to get out of sync.
Naturally it would (should) be easier if I could maintain only one file and auto generate the translated versions as needed. Fortunately the ACG pages are very simple, they are little more than a list of links with headings and subheadings. The pages were originally designed before the arrival of Netscape and should still be comprehensible to 1.0 browsers and still look good in text only browsers. Given this, my task is fairly simple.
To combine the two documents into one I took advantage of
the "<span>
" tag to mark up the translations. Generally
all translations will still use the same link, so the HTML appears like:
Note that the <span> </span>
sections above are just there so that the source document can be more easily
read in HTML viewers. The spans and spaces will be stripped out at filter
time.
In a few cases different the target page might also be language
specific, so the lang=".."
attribute is applied to the "<li>"
instead:
With translations marked up appropriately what remains is to run an XML filter on the document and exclude the languages the undesired translation (e.g. if generating the English document you exclude Amharic). And now the catch: When working with two or more completely different language and scripts the matter of sorting rears its ugly head. To avoid favoritism I try and keep the ACG links alphabetically ordered. Wrapping links around translations in the approach shown here, the ordering can be as per one language only. So after the extraction processes some sorting is due for at least one language.
Or both, once an auto sorting feature is built into the
filter the source document can be sloppily maintained knowing that the
sorting will be taken care of for you. The result is the filter
"makeacg.pl"
which processes the
source page "index.src.html"
.
Use as per:
The makeacg.pl script is tailored specifically for the meager needs of ACG pages, expect that you'll have to refine it for more complex sites. One thing it does not handle are nested <ul> structures. Presently it ignores one level of nested unordered lists, these I have to correct by hand after the document images are generated. Maybe I'll fix this later maybe not, 95% of the work is done for me so and I'm fairly content with that.
XHTML is an XML compliant strain of HTML 4.0 (maybe) and is likely the future of HTML itself.. The choice of XHTML over regular HTML was simply a step to prepare for the future and to be able to use XML tools now on HTML documents. Other than the header data that makes the document XHTML compliant I've not added anything new that wasn't in HTML 3.0 (or perhaps even earlier). In fact the HTML generated from the XHTML document does not go beyond the 3.0 level.
HTML parsers are much more forgiving than XML parsers and to use an XML parser on the XHTML source file the document had to become "well formed". The W3C validator was helpful for finding where changes had to be made. The following are cases where I had to revise the original ACG Ethiopia page to become a valid XHTML document.
</li>
s are required.
<br>
becomes <br />
.
alt="..."
attribute is required for images (<img ...>
tags).
<nobr>
tag is no longer defined.
<hr>
tag can not be used within <h[1-6]>
tags.
The last two problems I chose to ignore knowing that web browsers could handle it anyway (again I'm generating HTML 3.0 level output) and the result was going to look better than the alternatives.
There could well be some tricks with XSL and CSS that would have allowed the source document to be used directly while hiding the undesired language. I don't think the sorting matter could be easily solved though (JavaScript?). I still want the ACG pages to be available to a reasonable lowest-common-denominator web browser. Accordingly this means staying with HTML 3.0 for a while longer. My experience with XML as a web page document language has been rather miserable. The world of XML rendering is a bigger mess than HTML ever was. Each browser, and version, is going produce a different result. You can never be sure of what your web visitor is going to see, if anything. Trying to use XML derived technologies (XSL, etc) only makes the matter worse.
Aside from client side XML processing matters was XHTML essential? No, not at all. In fact XML parsers are less mature, less featureful, flexible and less friendly to use than HTML parsers which have been around a lot longer. The "makeacg.pl" would have been much easier to develop had I stuck with an HTML basis. On the other hand, I don't regret having been forced to make the pages "well formed".
The ACG Ethiopia page has been an English only document until now. Translating link titles into Amharic a few issues came up. The policies I wound up following are: