Skip to content

ASCII Everywhere, transliteration from UTF-8

Notifications You must be signed in to change notification settings

hroptatyr/aeiou

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ASCII Everywhere

No-frills, zero-dependency command-line tools to convert (transliterate) between UTF-8 and ASCII.

Red tape

  • no dependencies other than a POSIX system and a C99 compiler
  • licensed under BSD3c

Motivation

We believe that ASCII is still the common denominator in automated data oriented or manual data heavy work flows. In those fields it is common to interact simultaneously with lots of local peculiarities.

We do not believe that today's systems (i18n, l10n, locales, LANG variables and whatnot) are suited for the most basic tasks, for example rewriting a set of local date formats to ISO 8601, or times from different local timezones as Zulu times.

The new Zulu timezone for character encodings is UTF-8 Unicode but historically has been 7-bit ASCII as is evident in the DNS or protocols like SMTP.

So called user-friendly rapid-development languages like Python don't work at all when the environment isn't in sync with the expectations of the author. Heavy-duty tools like sort or grep behave differently for different environments. ASCII, however, seems to be the remedy to all problems.

translit

Tool to transliterate between UTF-8 encoded files and ASCII. Based on Sean Burke's Text::Unidecode. Unlike the perl version, translit can maintain case:

$ translit <<EOF
ЧАЩА
EOF
CHASHCHA

and condense spaces:

$ translit <<EOF
ノーベル賞の
EOF
no--beru Shang no

Another source of controversy in Sean Burke's version (and many other transliterators) is that certain characters transliterate to something different in different languages. The classic example being ü which Germans would transcribe as ue whereas Spanish people would transliterate pingüino as pinguino.

Language specific packs of transliterations can be generated with the translcc tool. The definitions themselves are plain C99 designated arrays. Language packs can be loaded by -l|--lang:

$ translit -l tr_639_1_de <<EOF
Überschuß
Äpfel
KÖRBE
EOF
Ueberschuss
Aepfel
KOERBE

Furthermore one character of context can be used to produce compound transliterations:

$ translit -l tr_639_1_ru <<EOF
Такси
EOF
Taxi

$ translit -l tr_639_1_ja <<EOF
シュヴァンク
EOF
syuvuanku

aeiou

Tool to replace unicode codepoint strings (\uxxxx or \Uxxxxxxxx) with their unicode character encoded as UTF-8, or vice versa.

Example:

$ /bin/echo '\u307e\u30c4\u3057\u305f' | aeiou | translit
matusita

$ /bin/echo '\u307e\u30c4\u3057\u305f' | aeiou | aeiou -d
\u307e\u30c4\u3057\u305f

Note: The shell's echo routine (e.g. zsh's) might already interpret the unicode sequence.