For a long time computers had a simple idea of a character: each octet (8-bit byte) of text contained one character. This meant an application could only use 256 characters at once. The first 128 characters (0 to 127) on Unix and similar systems usually corresponded to the ASCII character set, as they still do. So all other possibilities had to be crammed into the remaining 128. This was done by picking the appropriate character set for the use you were making. For example, ISO 8859 specified a set of extensions to ASCII for various alphabets.
This was fine for simple extensions and certain short enough relatives of the Latin alphabet (with no more than a few dozen alphabetic characters), but useless for complex alphabets. Also, having a different character set for each language is inconvenient: you have to start a new terminal to run the shell with each character set. So the character set had to be extended. To cut a long story short, the world has mostly standardised on a character set called Unicode, related to the international standard ISO 10646. The intention is that this will contain every single character used in all the languages of the world.
This has far too many characters to fit into a single octet. What's more, UNIX utilities such as zsh are so used to dealing with ASCII that removing it would cause no end of trouble. So what happens is this: the 128 ASCII characters are kept exactly the same (and they're the same as the first 128 characters of Unicode), but the remaining 128 characters are used to build up any other Unicode character by combining multiple octets together. The shell doesn't need to interpret these directly; it just needs to ask the system library how many octets form the next character, and if there's a valid character there at all. (It can also ask the system what width the character takes up on the screen, so that characters no longer need to be exactly one position wide.)
The way this is done is called UTF-8. Multibyte encodings of other character sets exist (you might encounter them for Asian character sets); zsh will be able to use any such encoding as long as it contains ASCII as a single-octet subset and the system can provide information about other characters. However, in the case of Unicode, UTF-8 is the only one you are likely to encounter that is useful in zsh.
(In case you're confused: Unicode is the character set, while UTF-8 is an encoding of it. You might hear about other encodings, such as UCS-2 and UCS-4 which are basically the character's index in the character set as a two-octet or four-octet integer. You might see files encoded this way, for example on Windows, but the shell can't deal directly with text in those formats.)
Until version 4.3, zsh didn't handle multibyte input properly at all. Each octet in a multibyte character would look to the shell like a separate character. If your terminal handled the character set, characters might appear correct on screen, but trying to edit them would cause all sorts of odd effects. (It was possible to edit in zsh using single-byte extensions of ASCII such as the ISO 8859 family, however.)
From version 4.3.4 (stable versions starting from 5.0), multibyte
input is handled in the line editor if zsh has been compiled with the
appropriate definitions, and is automatically activated. This is
indicated by the option MULTIBYTE
, which is set by default on
shells that support multibyte mode. Hence you can test this with a
standard option test: `[[ -o multibyte ]]
'.
The MULTIBYTE
option affects the entire shell: parameter expansion,
pattern matching, etc. count valid multibyte character strings as a
single character. You can unset the option locally in a function to
revert to single-byte operation.
As multibyte characters are nowadays standard across most utilities,
since 5.1 the MULTBYTE
option has been turned on when emulating
other shells.
The other option that affects multibyte support is COMBINING_CHARS
,
new in version 4.3.9. When this is set, any zero-length punctuation
characters that follow an alphanumeric character (the base character) are
assumed to be modifications (accents etc.) to the base character and to
be displayed within the same screen area as the base character. As not
all terminals handle this, even if they correctly display the base
multibyte character, this option is not on by default. Recent versions
of the KDE and GNOME terminal emulators konsole
and
gnome-terminal
as well as rxvt-unicode
, and the Unicode version
of xterm, xterm -u8
or the front-end uxterm
, are known to handle
combining characters.
The COMBINING_CHARS
option only affects output; combining characters
may always be input, but when the option is off will be displayed
specially. By default this is as a code point (the index of the
character in the character set) between angle brackets, usually
in inverse video. Highlighting of such special characters can
be modified using the new array parameter zle_highlight
.
Once you have a version of zsh with multibyte support, you need to ensure the environment is correct. We'll assume you're using UTF-8. Many modern systems may come set up correctly already. Try one of the editing widgets described in the next section to see.
There are basically three components.
LANG
(there are others but this is the one to start with). If you have
a recent operating system, very likely it is already set
appropriately. Otherwise, you need to find a locale whose name
contains UTF-8
. This will be a variant on your usual
locale, which typically indicates the language and country; for
example, mine is en_GB.UTF-8
. Luckily, zsh can complete
locale names, so if you have the new completion system loaded you
can type export LANG=
and attempt to complete a suitable
locale. It's the locale that tells the shell to expect the right
form of multibyte input. (However, there's no guarantee that the
shell is actually going to get this input: for example, if you
edit file names that have been created using a different character
set it won't work properly.)
konsole
and gnome-terminal
, are
likely to have extensive support for localization and may work
correctly as soon as they know the locale. You can enable UTF-8
support for xterm
in its application defaults file. The
following are the relevant resources; you don't actually need all of
them, as described below. If you use a ~/.Xdefaults
or
~/.Xresources
file for setting resources, prefix all the lines
with xterm
:
*wideChars: true *locale: true *utf8: 1 *vt100Graphics: trueThis turns on support for wide characters (this is enabled by the
utf8
resource, too); enables conversions to UTF-8 from other
locales (this is the key resource and actually overrides
utf8
); turns on UTF-8 mode (this resource is mostly used to
force use of UTF-8 characters if your locale system isn't up to it);
and allows certain graphic characters to work even with UTF-8
enabled. (Thanks to Phil Pennock for suggestions.)
iso10646-1
(and not, for
example, iso8859-1
). Not all characters will be available
in any font, and some fonts may have a more restricted range of
Unicode characters than others.
As mentioned in the previous section, bindkey -m
now outputs
a warning message telling you that multibyte input from the terminal
is likely not to work. (See 3.5 if you don't know what
this feature does.) If your terminal doesn't have characters
that need to be input as multibyte, however, you can still use
the meta bindings and can ignore the warning message. Use
bindkey -m 2>/dev/null
to suppress it.
You might also note that the latest version of the Cygwin environment
for Windows supports UTF-8. In previous versions, zsh was able
to compile with the MULTIBYTE
option enabled, but the system
didn't provide full support for it.
Two functions are provided with zsh that help you input characters.
As with all editing widgets implemented by functions, you need to
mark the function for autoload, create the widget, and, if you are
going to use it frequently, bind it to a key sequence. The
following binds insert-composed-char
to F5 on my keyboard:
autoload -Uz insert-composed-char zle -N insert-composed-char bindkey '\e[15~' insert-composed-char
The two widgets are described in the zshcontrib(1)
manual
page, but here is a brief summary:
insert-composed-char
is followed by two characters that
are a mnemonic for a multibyte character. For example a:
is a with an Umlaut; cH
is the symbol for hearts on a playing
card. Various accented characters, European and related alphabets,
and punctuation and mathematical symbols are available. The
mnemonics are mostly those given by RFC 1345, see
http://www.faqs.org/rfcs/rfc1345.html.
insert-unicode-char
is used to input a Unicode character by
its hexadecimal number. This is the number given in the Unicode
character charts, see for example http://www.unicode.org/charts/.
You need to execute the function, then type the hexadecimal number
(you can omit any leading zeroes), then execute the function again.
Both functions can be used without multibyte mode, provided the locale is correct and the character selected exists in the current character set; however, using UTF-8 massively extends the number of valid characters that can be produced.
If you have a recent X Window System installation, you might find
the AltGr
key helps you input accented Latin characters; for
example on my keyboard AltGr-; e
gives e
with an acute accent.
See also http://www.cl.cam.ac.uk/~mgk25/unicode.html#input
for general information on entering Unicode characters from a keyboard.