SSL authentication on Freenode with Emacs ERC

The Freenode IRC network allows clients to pass an SSL certificate and automatically identify their nick with NickServ upon logging in. Freenode offers instructions on creating the SSL certificate, as well as how to configure SSL authentication on several IRC clients, but it does not describe the setup for Emacs ERC.

I managed to get this working after some failed attempts following example code on the web. The problem is that an argument is missing from the gnutls-cli command described in other people’s Emacs init files that one comes across through a search. If one just runs gnutls-cli --x509certfile ~/.ssl/mynick.cert -p 6697 from the command line, one sees in the output: Successfully sent 0 certificate(s) to server. Of course, without a certificate sent to the server, the automatic NickServ identification will fail.

The correct way is to add the --x509keyfile argument, i.e. gnutls-cli --x509certfile ~/.ssl/mynick.cert --x509keyfile ~/.ssl/mynick.key -p 6697 When this is done, the output will show Successfully sent 1 certificate(s) to server. Then, NickServ identification will run automatically assuming that you have followed Freenode’s instructions and told NickServ what your certificate’s SHA1 fingerprint is.

A lot of people’s Emacs init files define tls-program as a global variable and specify the certificate to pass there. This is bad for privacy, as while one wants to disclose one’s identity to Freenode, you probably don’t want to potentially tell every other server contacted through SSL who you are. Therefore, the best thing to do is create a function to call ERC, and use Emacs’ let statement to define a value of tls-program that will only be valid for ERC:

(defun start-irc ()
"Connect to IRC over SSL and pass a certificate for nick identification."
(let ((tls-program '("gnutls-cli --x509certfile ~/.ssl/mynick.cert --x509keyfile ~/.ssl/mynick.key -p %p %h")))
(erc-tls :server "" :port 6697
        :nick "mynick" :full-name "mynick")))

For Freenode, as with all SSL connections through Emacs, users may also want to consider the certificate pinning function that GnuTLS provides, see Jens Lechtenbörger’s Certificate Pinning for GNU Emacs.

More Chuvash and Mari at OpenStreetMap

I am drawing up a table of placename abbreviations from Ashmarin’s Chuvash dictionary along with their geographical coordinates, e.g. Урас-к. = д. Ураз-касы, Янтиковского района ЧАССР = 55.571, 47.7352. This will allow me to more easily map the distribution of some isoglosses that have interested me. For the most part, it has been very easy to link Ashmarin’s villages with contemporary ones, though there are a small number of villages which either no longer exist, or which were drastically renamed after the October Revolution.

In the course of doing this research, I’ve added the Chuvash names for several hundred villages in Chuvashia and in the Chuvash diaspora to OpenStreetMap (a project I am passionate about, as I described here). One of the strange things I’ve discovered is that Tatars and Bashkirs are more likely to recognize Chuvash than editors from Chuvashia. Very, very few villages in Chuvashia were marked with a Chuvash name on OSM when I began this project, but villages in Tatarstan and Bashkiria that historically had a Chuvash population were often marked with the Chuvash name alongside the Russian, Tatar or Bashkir name.

In two instances for Chuvash villages within Chuvashia, someone had specified the Chuvash name not with the name:cv tag but with the old_name tag, which just breaks my heart.

Many of the Chuvash placenames floating around the internet were drawn from the Chuvash Encyclopedia, an authoritative reference source. However, the Chuvash Encyclopedia was digitized at some early time when Chuvash fonts weren’t thought widely available. Thus, for the Chuvash letters ҫ,ӗ,ӑ,ӳ, the Chuvash Encyclopedia actually uses the similar-looking codepoints from the Latin-1 block of Unicode, not the Cyrillic block. Because the names were copied and pasted elsewhere, this error persists in the Tatar-language Wikipedia and some OpenStreetMap points. I suppose I’ve have to write a script to automate correcting these on OSM.

For the moment I am not so enthusiastic about adding Mari placenames, because existing Meadow Mari/Eastern Mari placenames are marked up variously with name:chm and name:mhr. I’ve never thought about the existence of three ISO 639-3 codes for Mari (Mari in general and Meadow Mari/Eastern Mari respectively, plus mrj for Hill Mari) as a problem before, but because OSM generates map tiles based on one and only one ISO 639 code, some Mari-language names will not be visible whichever code one chooses. I suppose this too will have to be automated with a script, however redundant it might seem to add both name:chm and name:mhr to every single point.

Mari and Udmurt dictionaries for Kindle (with caveats)

A screenshot of the Udmurt dictionary on the KindleThere are fairly ample Mari-Russian and Udmurt-Russian dictionaries in the Goldendict format (get them here). And there is a toolchain that can convert a Goldendict dictionary to Mobipocket format for use on the Kindle and other e-readers. It works like this:

  1. Decompress the Goldendict file (.dsl.gz) with gzip.
  2. Pass the resulting .dsl file to dsl2mobi. This Ruby script will produce an HTML file (the dictionary data) and an .opf file (the metadata).
  3. Open the .opf file in a text editor, look down through the XML, and specify the title of the dictionary, as well as the input and output languages of the dictionary through their ISO 639 codes (udm or chm and ru in this case).
  4. Run the Windows application mobigen.exe (available from from Mobipocket) on the .opf file. Using the option -c2 enables compression for a much smaller size. This will produce a .mobi file that can then be moved to the documents/dictionaries/ directory on the Kindle.

This worked for the Mari and Udmurt dictionaries. I can highlight certain words on my Kindle and an entry from the Mari or Udmurt dictionary will pop up. The only problem is that these particular two dictionaries don’t have much in the way of morphology, that is, they don’t have all the inflected forms of words.

The Udmurt dictionary is the more usable, as for verbs often the past participle is included separately, with a link to the main entry for the verb under its infinitive form (e.g. highlighting потэм will pop up a window with a link to потыны), but highlighting any other inflected form like потӥсько will result in an error that the word is not found. Because the Kindle allows one to highlight only whole words, not a subset of letters within them, one cannot even leave off, say, the definite suffix -ез of a word like ужез to get the entry for уж. Perhaps that is possible on other e-readers.

The Mari dictionary provides automatic lookup of words only if they are the infinitive for verbs or the nominative singular for nouns.

Besides trying to look up words by highlighting them in a text, there is also the option of simply opening the dictionary from the Kindle home screen and searching for the word one needs with the Kindle keyboard. The caveat here is that one can only search in a Latin transliteration based on Russian Cyrillic, and there does not appear to be a way to input the extended Cyrillic characters used by Mari and Udmurt.

One might ask what the point of creating these dictionaries is when there are not yet any e-books for these languages. There may be no editions of classic literature yet in e-book form (though I’m working on an e-book version of Chavain’s Elnet), but there are a number of Udmurt and Mari blogs from ethnofuturistically-minded people, and with Calibre one can convert the RSS feeds of these blogs to a format suitable for reading on the Kindle. It’s a good way to have language practice on the go, even if the minutiae of the writers’ lives are not always particularly interesting in themselves.

Language-related uses of a device running XBMC

A photograph of the Raspberry PILast year I bought a Raspberry Pi to use as a media centre, though only lately have I been able to spend enough time at home to really explore its uses. I installed the Raspbmc operating system, which features everything one would expect from a Linux distribution but offers the XBMC media center as its main interface.

XBMC has a plugin architecture. Here are two XBMC plugins that I’ve come across so far to conveniently watch content in languages I keep up with:

Youtube Channels

A screenshot of Kazakh television programming on the XBMC through the Youtube Channels plugin.This XBMC plugin allows one to save certain Youtube playlists for easy access. I’ve used this for television programmes that the state broadcaster uploads to YouTube, namely Chuvash, Kazakh and Mari. When prompted to search for a channel, use the Latin-alphabet username of the Youtube account.

This plugin is part of the central XBMC plugin repository and can be easily enabled from XBMC settings.

Macedonia on Demand

In spite of its name, this XBMC plugin offers more than just Macedonian-language channels. I got it for the Croatian and Serbian programming made available through this plugin, and I was pleased to find that one of the Macedonian channels is for the country’s Albanian-speaking population.

This plugin is not yet part of the central XBMC plugin repository but must be downloaded from the developer’s website.

Language-based CSS hyphenation across browsers

In 2011 there was an evolution in several major web browsers’ treatment of CSS to allow hyphenation in justified text. This was initially implemented through browser-specific CSS extensions, so one has to include the following list of CSS properties in one’s stylesheet to cover most platforms:

-webkit-hyphens: auto;
-moz-hyphens: auto;
-ms-hyphens: auto;
-o-hyphens: auto;
hyphens: auto;

The Open Web Group has an informative article on web hyphenation and how it will change with CSS version 4.

Now, hyphenation makes no sense if it is not based on rules specifically formulated for the language of the text in question, which the browser knows to choose based on the value of an HTML’s element’s lang attribute. Continue reading Language-based CSS hyphenation across browsers

DejaVu fonts

I encourage my readers in the sidebar to view this weblog with the fonts provided by the Dejavu project. However, I would like to remind readers that it’s not enough to just install these fonts, as one also should upgrade them frequently. Version 2.25 of the Dejavu fonts was released earlier this week, and in the Changelog one notices something directly applicable to the matters I write about: …added Kurdish and Chuvash letters from Unicode 5.1 Cyrillic Extended block.

While fonts which seek to completely support Unicode have sufficient coverage in their initial releases to satisfy the majority of the world, speakers of languages with especially exotic scripts get support only later on, and then the font hinting of those Unicode regions is carried out gradually over subsequent releases. But unfortunately, it’s precisely those minorities who probably have little knowledge of computing and little contact with the Unicode community, and so one can hardly expect them to make timely upgrades. It’s a difficult conundrum.

More Unicode cluelessness in Mari El

The supposed lack of support for Mari Unicode Cyrillic on Windows has gotten wide media attention in Russia, and regrettably it’s a simple case of people not knowing how to work their computer. In one television report, a woman complains that her printouts of Mari materials from an Internet cafe are gibberish because they don’t have the non-standard font Mari Times New Roman, yet in the Internet cafes I’ve visited in Yoshkar-Ola it’s trivial to install your own fonts when you start your session.

There’s also numerous misunderstandings of the nature of Unicode. A recent article at MariUver brings attention to an obsolete Mari letter:

Until 1929 there was a sixth extra letter, yeru (ы) with a breve. This sign was used quite often in printed materials in the beginning of the 20th century.

In order to translate old Mari newspapers, magazines and books into modern electronic formats we need that letter. But it’s not to be found in the table of Unicode signs.

Well, it’s not to be found in the tables because it’s not a precomposed character, but it’s trivial to produce the letter from the combination of u+044b cyrillic small letter yeru and u+0306 combining breve. Here’s a screenshot from my computer.

The Cyrillic letter yeru typeset with a breve above.

Surely there are some Unicode-savvy native speakers of Mari out there. Would that they translate basic introductions to Unicode into Mari to give to their compatriots.

Lobbying for Mari script in Windows

Three Mari speakers have written a long statement in the form of an ‘open letter to Bill Gates’ calling for the inclusion of certain characters to the standard Windows fonts so that Mari speakers no longer have to rely on antiquated solution.

Much Mari content on the web is still written in 8-bit encoding schemes that require one to install special and non-standard fonts, and without that special font the content is gibberish. While the Mari world is increasingly embracing Unicode, generally these webpages use Unicode’s Cyrillic block-characters for the bulk of the writing, but then place the Latin-block characters ö and ÿ for Mari’s two rounded vowels. Of course, that’s not how it’s supposed to work at all.

However, even were the call for default Windows support successful, it wouldn’t swiftly bring the Mari community to a greater and proper use of Unicode. Often, computers in Mari El are antiquated, and I’ve encountered many Mari speakers using versions of Windows from the 1990s. Who knows how many years would have to pass before these fonts were present by default on the majority of computers in the Mari-speaking world.

Some might bring up the fact that Windows has a universal font for all world languages called Arial Unicode MS! But what if not everyone like Arial, what if someone wants to write in his own native language using the font Times? What should those thousands of Windows users do? And generally, it rather seems that small peoples like us are left with just Arial Unicode MS. What should be done about sites that don’t want to use Arial Unicode MS? Where can you publish thousands of books, thousands of pages, using a font other than Arial Unicode MS?

Well, my response would be, regarding print publication: give your texts to a typesetter who already knows how to efficiently and properly prepare publications in whatever script, because the writers shouldn’t be thinking about typefaces themselves. Historically, the major presses of the Finno-Ugrian peoples have produced fine-looking books (pity about that Soviet-era paper though). In recent years, however, the quality of these publications has diminished, and there is a worrying trend of self-publication where apparently people think that if they have Microsoft Word, then they are qualified to typeset their books themselves.

I would hope to see the survival of professional-quality typesetting in Mari. One of my big projects this year is producing Mari hyphenation files for LaTeX, and would that Mari-language writers turn to me or others using similar appropriate tools when it comes time to prepare their manuscript for publication.

Oxford University Press’ typesetting style

A neat discovery recently was Hart’s Rules for Compositors and Readers at the University Press, Oxford (Oxford: Oxford University Press, 38th ed. 1978), the handbook for transforming inconsistent manuscript into the beauty that is the traditional Oxford appearance, which is continued today most notably in the Clarendon Press offerings. Originally written by typesetter Horace Hart in 1864, the work circulated internally at OUP, with some government officials and friends of employees occasionally geting a copy. In 1914, this arcane text was somewhat unwillingly made available to the public at last:

Recently, however, it became known that copies of the booklet were on sale in London. A correspondent wrote that he had just bought a copy at the Stores; and as it seems more than complaisant to provide gratuitously what may afterwards be sold for profit, there is no alternative but to publish this little book.

Besides the expected rules for how to use punctuation in English texts, which to choose from when there are alternative forms (e.g. ‘ambience’ versus ‘ambiance’), how to space the material, and so forth, it also contains advice on various languages. There’s hyphenation rules and guidelines for italics in Russian, Greek, Italian, French, Spanish, Portuguese, and even Catalan. Owing to the long survival of ‹œ› in British publications I would have never noticed that the ligature is to be used only in Old English and French, replaced by two letters in Latin and Greek words.

Of course, coming from a pre-computing epoch where all that matters is the final printed result, much of its advice is horrifying to one who has come to appreciate the split between semantic meaning and graphic appearance in Unicode. For example, in the chapter ‘Oriental Languages in Roman Type’ we find:

In Semitic languages ‛ain and hamza/’aleph are to be represented by a Greek asper and a lenis respectively.

If one is typesetting a document in XML that is meant to have both print and database output, then obviously this using a Greek Unicode position for Semitic material will create difficulties in searching and in transforming data.

Still, the 38th edition clearly includes many contemporary additions. A clever guideline in the chapter on musical works is that ‘ring-modulator’ must have a hyphen, evidently some work had previously been written on Stockhausen where it was necessary to make this decision.

The work has been updated since the version I found in University of Helsinki’s library, with the thirty-ninth edition coming in 1983, and then the traditional title seemingly superseded by the Oxford Guide to Style in 2002 and New Hart’s Rules: The Handbook of Style for Writers and Editors in 2005. None, however, resolves the issue of creating semantically perfect text.