• Skip Navigation |
  • Accessibility 
IT-Director.com Logo
  • Conficker grounds police checks
  • What's wrong with \
  • What is Total Cost of Ownership, and Why Should You Care?
 

Main navigation - go to a section of this website:

  • ARCHIVE
  • PAPERS
  • EVENTS
  • NEWSWIRE
  • BLOGS

  

Member Login | Become a Member

 
 
DOMAINS
  • Enterprise
  • SME
  • Business Issues
  • Technology
  • Services
  • Channels
FEATURED EVENTS
  • Enterprise Level Business Process Management
    22nd March - 23rd March
    London, United Kingdom
  • Handling Subject Access Requests ( SAR's )
    23rd March
    London, United Kingdom
POPULAR PAPERS
  • Mobile Application Momentum by Quocirca
  • Telecoms reinvention - optimising the online customer experience by Quocirca
  • Enterprise Performance Management - Cycle II by Quocirca
TRANSLATE PAGE



USEFUL LINKS
  • Last 7 Days
  • Archives
  • Market Place
  • Top Articles
INTERACT
  • Advertising
  • Site Feedback
  • Newsletters
  • Contact Us
  • Registration
CONTENT FEED

Sitewide
RSS Feed:

RSS Icon

What is RSS?

RANDOM QUOTE
Famous Slights - "You have all the characteristics of a popular politician: a horrible voice, bad breeding and a vulgar manner." - Aristophanes

ADVERTISEMENT
Blogs > Abrahams Accessibility

How to tag documents with multiple languages and scripts.

Peter Abrahams By: Peter Abrahams, Practice Leader - Accessibility and Usability, Bloor Research
Published: 5th January 2009
Copyright Bloor Research © 2009
Logo for Bloor Research
Page Tools

Request Reprints
Tell A Friend
Contact Author

Recent Blog Posts
  • Lotus Symphony 3 beta 2 available for testing
  • Technology Bill of Rights for the Blind Act of 2010
  • Accessibility 2010 start this week
  • Accessibility Conferences
  • Nominations open for Doing IT Differently Award 2009
  • Testing websites with the WebAnywhere screen-reader
Blog Archive
  • February, 2010
  • January, 2010
  • October, 2009
  • September, 2009
  • August, 2009
  • July, 2009
  • June, 2009
  • April, 2009
  • March, 2009
  • February, 2009
  • December, 2008
  • November, 2008
Syndication
  • Delicious Icon Delicious
  • Digg Icon Digg
  • reddit Icon reddit
  • Facebook Icon Facebook
  • StumbleUpon Icon StumbleUpon

A happy New Year to all my readers.

This holiday season was unusual in the fact that the Christian festival Christmas, the Jewish festival Chanukkah (חנכה), and the Islamic New Year Maal Hijra, all occurred at the same time.

The previous sentence raises the question as to how it should be tagged in HTML. It contains three different languages, the Hebrew in its native script and in transliteration, and the Arabic in transliteration only. To add to the complication the Hebrew script should be read from right to left whilst its transliteration should be read from left to right.

Before I try and answer this question I need to briefly explain why it is important to tag multilanguage documents correctly. The reasons include accessibility needs such as:

  • Screen readers need to know what language they are reading so that they can pronounce it properly, or announce that the text is in a language that they do not recognize.
  • Screen magnifiers may use the direction of text to decide on how they should move around the screen.

Besides the accessibility needs other systems may be able to benefit from knowing the language of the text:

  • Spelling checkers need to know the language of the text so that they can check against the correct dictionary, or ignore the text if they have no dictionary to check against.
  • Tools that allow you to ask the definition of a highlighted word obviously need to know which language the word is in so that they can give you either a dictionary definition or translation.
  • Search engines may also be able to use the language of the text to improve their categorisation and search results.

Having set myself this holiday question to investigate I went straight onto the web. I quickly discover that there are two attributes related to internationalisation (I18n):

  • 'Dir' that specifies the direction of content, the values can be 'ltr' (left to right) or 'rtl' (right to left).
  • 'Xml:lang' that specifies the language of the text and can have values such as: 'en' (English) or 'fr' (French).

My next discovery was that there is an international standard (ISO 639 -1) that specifies the two character abbreviations of languages; so I found out that Arabic is " ar and Hebrew is he. Which left me with the problem of how to distinguish between Hebrew in native script and transliteration.

This led me into the world of Request for Comments (RFC) of the Internet Engineering Task Force (IETF). Being a world of standards it is by nature very detailed, precise and pedantic. This is as it has to be but it does make it difficult for a newcomer to comprehend and be able to navigate to the relevant area. I found out that a language attribute can be made up of more than one part and found a list of recognized combinations; this included 'az-Latn' for Azerbaijani transliterated in to Latin text. Thus it appeared to me that using 'he-Latn' would be a reasonable answer for my Hebrew transliteration. However, the document I was looking at said that I had to formally register it. My attempt to register it failed with a message that suggested that my formatting of the request was incorrect. Luckily I had found an e-mail address of someone who obviously understood the subject and I decided to use the personal touch rather than talk to a computer again. I am delighted to say that this approach resulted in a very quick response even though it was that their days between New Year and the restarting of work next week.

A few more e-mails from the RFC community explained everything to me. I had been looking at an out of date RFC and I should have been looking at RFC 4646. This Best Current Practice (BCP) says that a language attribute can be made up of sections relating to language, script, region and variants. The agreed values of these sections can be found here and they can be combined in any reasonable way, which includes ‘he-Latn' and ‘ar-Latn'.

So I now have the answer to my question. If you look at the source of the relevant sentence you will see that it has been tagged correctly.

It is also relevant to point out that although this article has concentrated on HTML the language attribute can be used in other forms of documentation, for example tagged PDF.

I would like to thank all those who have helped me on this journey.

It has raised two new questions for me:

My journey was more complex because Google initially pointed me at the older documents on the subject. I assume that this was because there were more references to the older documents. Is there any way we can ensure that old and obsolete documents drop down the Google search list more quickly?

I also found the standards documents difficult to understand as a newcomer. Is there any way to make them easier to understand by relatively casual users like myself. I am hoping that writing this article may help other people who are trying to solve the same or a similar problem.

Wishing everyone an accessible and usable and well tagged New Year.

Reader Comments

We are no longer accepting comments against this item. We suggest contacting the author directly.

5th January 2009: 'Peter Abrahams' (Author) said:

Having written this article, it has been pointed out to me that the w3c has some useful documents on I18n http://www.w3.org/International/resource-index.php?topic=lang and Tutorial: Creating (X)HTML Pages in Arabic & Hebrew http://www.w3.org/International/tutorials/bidi-xhtml/ .
Which gives more detail than my article can.

Reply to Peter Abrahams?

The messages above were all contributed by IT-Director.com readers. Whilst we take care to remove any posts deemed inappropriate, we can take no responsibility for these comments. If you would like a comment removed please contact our editorial team.

  • Site Map
  • | Terms of Use
  • | Privacy

Published by: IT Analysis Communications Ltd.
T: +44 (0)1908 880760 | F: +44 (0)1908 880761