Category Archives: Computing

DejaVu fonts

I encourage my readers in the sidebar to view this weblog with the fonts provided by the Dejavu project. However, I would like to remind readers that it’s not enough to just install these fonts, as one also should upgrade them frequently. Version 2.25 of the Dejavu fonts was released earlier this week, and in the Changelog one notices something directly applicable to the matters I write about: …added Kurdish and Chuvash letters from Unicode 5.1 Cyrillic Extended block.

While fonts which seek to completely support Unicode have sufficient coverage in their initial releases to satisfy the majority of the world, speakers of languages with especially exotic scripts get support only later on, and then the font hinting of those Unicode regions is carried out gradually over subsequent releases. But unfortunately, it’s precisely those minorities who probably have little knowledge of computing and little contact with the Unicode community, and so one can hardly expect them to make timely upgrades. It’s a difficult conundrum.

More Unicode cluelessness in Mari El

The supposed lack of support for Mari Unicode Cyrillic on Windows has gotten wide media attention in Russia, and regrettably it’s a simple case of people not knowing how to work their computer. In one television report, a woman complains that her printouts of Mari materials from an Internet cafe are gibberish because they don’t have the non-standard font Mari Times New Roman, yet in the Internet cafes I’ve visited in Yoshkar-Ola it’s trivial to install your own fonts when you start your session.

There’s also numerous misunderstandings of the nature of Unicode. A recent article at MariUver brings attention to an obsolete Mari letter:

Until 1929 there was a sixth extra letter, yeru (ы) with a breve. This sign was used quite often in printed materials in the beginning of the 20th century.

In order to translate old Mari newspapers, magazines and books into modern electronic formats we need that letter. But it’s not to be found in the table of Unicode signs.

Well, it’s not to be found in the tables because it’s not a precomposed character, but it’s trivial to produce the letter from the combination of u+044b cyrillic small letter yeru and u+0306 combining breve. Here’s a screenshot from my computer.

The Cyrillic letter yeru typeset with a breve above.

Surely there are some Unicode-savvy native speakers of Mari out there. Would that they translate basic introductions to Unicode into Mari to give to their compatriots.

Lobbying for Mari script in Windows

Three Mari speakers have written a long statement in the form of an ‘open letter to Bill Gates’ calling for the inclusion of certain characters to the standard Windows fonts so that Mari speakers no longer have to rely on antiquated solution.

Much Mari content on the web is still written in 8-bit encoding schemes that require one to install special and non-standard fonts, and without that special font the content is gibberish. While the Mari world is increasingly embracing Unicode, generally these webpages use Unicode’s Cyrillic block-characters for the bulk of the writing, but then place the Latin-block characters ö and ÿ for Mari’s two rounded vowels. Of course, that’s not how it’s supposed to work at all.

However, even were the call for default Windows support successful, it wouldn’t swiftly bring the Mari community to a greater and proper use of Unicode. Often, computers in Mari El are antiquated, and I’ve encountered many Mari speakers using versions of Windows from the 1990s. Who knows how many years would have to pass before these fonts were present by default on the majority of computers in the Mari-speaking world.

Some might bring up the fact that Windows has a universal font for all world languages called Arial Unicode MS! But what if not everyone like Arial, what if someone wants to write in his own native language using the font Times? What should those thousands of Windows users do? And generally, it rather seems that small peoples like us are left with just Arial Unicode MS. What should be done about sites that don’t want to use Arial Unicode MS? Where can you publish thousands of books, thousands of pages, using a font other than Arial Unicode MS?

Well, my response would be, regarding print publication: give your texts to a typesetter who already knows how to efficiently and properly prepare publications in whatever script, because the writers shouldn’t be thinking about typefaces themselves. Historically, the major presses of the Finno-Ugrian peoples have produced fine-looking books (pity about that Soviet-era paper though). In recent years, however, the quality of these publications has diminished, and there is a worrying trend of self-publication where apparently people think that if they have Microsoft Word, then they are qualified to typeset their books themselves.

I would hope to see the survival of professional-quality typesetting in Mari. One of my big projects this year is producing Mari hyphenation files for LaTeX, and would that Mari-language writers turn to me or others using similar appropriate tools when it comes time to prepare their manuscript for publication.

Oxford University Press’ typesetting style

A neat discovery recently was Hart’s Rules for Compositors and Readers at the University Press, Oxford (Oxford: Oxford University Press, 38th ed. 1978), the handbook for transforming inconsistent manuscript into the beauty that is the traditional Oxford appearance, which is continued today most notably in the Clarendon Press offerings. Originally written by typesetter Horace Hart in 1864, the work circulated internally at OUP, with some government officials and friends of employees occasionally geting a copy. In 1914, this arcane text was somewhat unwillingly made available to the public at last:

Recently, however, it became known that copies of the booklet were on sale in London. A correspondent wrote that he had just bought a copy at the Stores; and as it seems more than complaisant to provide gratuitously what may afterwards be sold for profit, there is no alternative but to publish this little book.

Besides the expected rules for how to use punctuation in English texts, which to choose from when there are alternative forms (e.g. ‘ambience’ versus ‘ambiance’), how to space the material, and so forth, it also contains advice on various languages. There’s hyphenation rules and guidelines for italics in Russian, Greek, Italian, French, Spanish, Portuguese, and even Catalan. Owing to the long survival of ‹œ› in British publications I would have never noticed that the ligature is to be used only in Old English and French, replaced by two letters in Latin and Greek words.

Of course, coming from a pre-computing epoch where all that matters is the final printed result, much of its advice is horrifying to one who has come to appreciate the split between semantic meaning and graphic appearance in Unicode. For example, in the chapter ‘Oriental Languages in Roman Type’ we find:

In Semitic languages ‛ain and hamza/’aleph are to be represented by a Greek asper and a lenis respectively.

If one is typesetting a document in XML that is meant to have both print and database output, then obviously this using a Greek Unicode position for Semitic material will create difficulties in searching and in transforming data.

Still, the 38th edition clearly includes many contemporary additions. A clever guideline in the chapter on musical works is that ‘ring-modulator’ must have a hyphen, evidently some work had previously been written on Stockhausen where it was necessary to make this decision.

The work has been updated since the version I found in University of Helsinki’s library, with the thirty-ninth edition coming in 1983, and then the traditional title seemingly superseded by the Oxford Guide to Style in 2002 and New Hart’s Rules: The Handbook of Style for Writers and Editors in 2005. None, however, resolves the issue of creating semantically perfect text.

Icelandic and TeX

I came across an old article from a 1989 issue of TUGboat called ‘Lexicography with TeX’ that describes how the Institute of Lexicography of the University of Iceland typeset its etymological dictionary with TeX. The results are very handsome, and considering that the trials and tribulations they had to go largely disappeared in successive versions of LaTeX, one wonders why this platform isn’t used for dictionaries more often.

Meanwhile, the Germanic Lexicon Project has made Henry Sweet’s An Icelandic Primer (Oxford: Clarendon Press, 1895) freely available. Besides offering scans in TIFF format, HTML, and MS Word, someone typeset the book anew (PDF) using LaTeX. The result is a little rough around the edges, since everything was left in the default formatting, but beautifully typesetting books when digitizing them is another thing that really should be done more often.

Early multilingual typography

Computer font projects face the challenge of covering not only the English alphabet but the many other Latin variants as well as even more exotic scripts. An example is the DejaVu fonts, recommended for viewing my personal website and weblog, by the way (but make sure you update upon a new release). I was naive to think that this is a recent development, but I’ve now found a scan (attention, 1.8 MB image file!) of a specimen sheet by famed designer William Caslon.

After Greek and Hebrew scholarship had long been restored in Western Europe, one could expect that designers would consider those scripts important. Caslon also developed typefaces for Arabic, Syriac, Coptic, and Armenian. But I’m surprised that even as early as 1728, the date of the encyclopedia which the specimen sheet was included in, designers were already developing typefaces for the Gothic alphabet. That didn’t even make it into Unicode until version 5.0, and few fonts contain it.

Unicode 5.0

Having survived the London airport madness that began late last week just as I left, and having made it to Helsinki, with significant expense and delay due to canceled flights, I’ll get back in business here.

Since it wasn’t so long ago that all celebrated the appearance of Unicode version 4.1 with its new support for the Glagolitic script, among others, the appearance of Unicode 5.0 took me by surprise. I learnt of it only because the DejaVu Fonts Project has been adding characters I had never heard of before.

Among the notable additions is u+04cf cyrillic small letter palochka, that ubiquitous Caucasian mark that has long been sorely missed from Unicode, and a large amount of cuneiform.

Text and translation from Codex Suprasliensis

Many of the reading selections in Robert Auty’s Handbook of Old Church Slavonic: Part II Texts and Glossary (London: The Athlone Press, 1960) present little challenge as they are from the OCS translation of the New Testament and so are already familiar to the student. However, one selection sure to be unknown to readers is the life of St Gregory from the East Bulgarian manuscript Codex Suprasliensis. I have placed the original text and a translation of the selection on my website.

The page uses an extravaganza of web standards that are not supported by Microsoft’s Internet Explorer, such the q (quote) and abbr (abbreviation) tags and setting fonts based on the language of each portion as communicated by the xml:lang attribute. If you can’t see the page properly and use IE, consider switching to Firefox. If you use Firefox or another decent browser and still can’t see the page properly, please send me a screenshot.

The limitations of Unicode’s Cyrillic block as it now stands became especially irksome while I was typing the OCS original. Like every OCS manuscript written with the Cyrillic alphabet, the text makes use of iotified-A, but for some inexplicable reason this is not in Unicode and so I’ve been forced to use u+044f cyrillic small letter ya. I was able to include the titlon and palalisation sign, and I could add the Cyrillic-space breathing marks (which are, of course, meaningless in OCS), but there seems to be no specific Cyrillic-block circumflex accent.

Adobe says no to exotic Cyrillic

On the URA-List, Johanna Laakso brought attention to an announcement at an Adobe employee’s weblog suggesting that Adobe will not be supporting the Cyrillic characters used in Mari, Udmurt, and Komi-Zyrian, as well as the neighbouring Turkic languages Bashkir and Chuvash. Apparently even common Old Church Slavonic characters will not be provided. Feedback can be posted there.

The problem isn’t in the realm of character sets, sticking with an outdated system of code pages instead of embracing Unicode. Instead, it’s just a matter of Adobe not wanting to undertake the painstaking task of designing fonts that cover the entire Cyrillic range of Unicode. Well, at least LaTeX’s Computer Modern font family has long been extended to cover almost all Cyrillic-based alphabets, it’s all free.