Chapter 5: Multibyte input and output

5.1: What is multibyte input?

For a long time computers had a simple idea of a character: each octet (8-bit byte) of text contained one character. This meant an application could only use 256 characters at once. The first 128 characters (0 to 127) on Unix and similar systems usually corresponded to the ASCII character set, as they still do. So all other possibilities had to be crammed into the remaining 128. This was done by picking the appropriate character set for the use you were making. For example, ISO 8859 specified a set of extensions to ASCII for various alphabets.

This was fine for simple extensions and certain short enough relatives of the Latin alphabet (with no more than a few dozen alphabetic characters), but useless for complex alphabets. Also, having a different character set for each language is inconvenient: you have to start a new terminal to run the shell with each character set. So the character set had to be extended. To cut a long story short, the world has mostly standardised on a character set called Unicode, related to the international standard ISO 10646. The intention is that this will contain every single character used in all the languages of the world.

This has far too many characters to fit into a single octet. What's more, UNIX utilities such as zsh are so used to dealing with ASCII that removing it would cause no end of trouble. So what happens is this: the 128 ASCII characters are kept exactly the same (and they're the same as the first 128 characters of Unicode), but the remaining 128 characters are used to build up any other Unicode character by combining multiple octets together. The shell doesn't need to interpret these directly; it just needs to ask the system library how many octets form the next character, and if there's a valid character there at all. (It can also ask the system what width the character takes up on the screen, so that characters no longer need to be exactly one position wide.)

The way this is done is called UTF-8. Multibyte encodings of other character sets exist (you might encounter them for Asian character sets); zsh will be able to use any such encoding as long as it contains ASCII as a single-octet subset and the system can provide information about other characters. However, in the case of Unicode, UTF-8 is the only one you are likely to encounter that is useful in zsh.

(In case you're confused: Unicode is the character set, while UTF-8 is an encoding of it. You might hear about other encodings, such as UCS-2 and UCS-4 which are basically the character's index in the character set as a two-octet or four-octet integer. You might see files encoded this way, for example on Windows, but the shell can't deal directly with text in those formats.)

5.2: How does zsh handle multibyte input and output?

Until version 4.3, zsh didn't handle multibyte input properly at all. Each octet in a multibyte character would look to the shell like a separate character. If your terminal handled the character set, characters might appear correct on screen, but trying to edit them would cause all sorts of odd effects. (It was possible to edit in zsh using single-byte extensions of ASCII such as the ISO 8859 family, however.)

From version 4.3.4 (stable versions starting from 5.0), multibyte input is handled in the line editor if zsh has been compiled with the appropriate definitions, and is automatically activated. This is indicated by the option MULTIBYTE, which is set by default on shells that support multibyte mode. Hence you can test this with a standard option test: `[[ -o multibyte ]]'.

The MULTIBYTE option affects the entire shell: parameter expansion, pattern matching, etc. count valid multibyte character strings as a single character. You can unset the option locally in a function to revert to single-byte operation.

As multibyte characters are nowadays standard across most utilities, since 5.1 the MULTBYTE option has been turned on when emulating other shells.

The other option that affects multibyte support is COMBINING_CHARS, new in version 4.3.9. When this is set, any zero-length punctuation characters that follow an alphanumeric character (the base character) are assumed to be modifications (accents etc.) to the base character and to be displayed within the same screen area as the base character. As not all terminals handle this, even if they correctly display the base multibyte character, this option is not on by default. Recent versions of the KDE and GNOME terminal emulators konsole and gnome-terminal as well as rxvt-unicode, and the Unicode version of xterm, xterm -u8 or the front-end uxterm, are known to handle combining characters.

The COMBINING_CHARS option only affects output; combining characters may always be input, but when the option is off will be displayed specially. By default this is as a code point (the index of the character in the character set) between angle brackets, usually in inverse video. Highlighting of such special characters can be modified using the new array parameter zle_highlight.

5.3: How do I ensure multibyte input and output work on my system?

Once you have a version of zsh with multibyte support, you need to ensure the environment is correct. We'll assume you're using UTF-8. Many modern systems may come set up correctly already. Try one of the editing widgets described in the next section to see.

There are basically three components.

The locale. This describes a whole series of features specific to countries or regions of which the character set is one. Usually it is controlled by the environment variable LANG (there are others but this is the one to start with). If you have a recent operating system, very likely it is already set appropriately. Otherwise, you need to find a locale whose name contains UTF-8. This will be a variant on your usual locale, which typically indicates the language and country; for example, mine is en_GB.UTF-8. Luckily, zsh can complete locale names, so if you have the new completion system loaded you can type export LANG= and attempt to complete a suitable locale. It's the locale that tells the shell to expect the right form of multibyte input. (However, there's no guarantee that the shell is actually going to get this input: for example, if you edit file names that have been created using a different character set it won't work properly.)
The terminal emulator. Those that are supplied with a recent desktop environment, such as konsole and gnome-terminal, are likely to have extensive support for localization and may work correctly as soon as they know the locale. You can enable UTF-8 support for xterm in its application defaults file. The following are the relevant resources; you don't actually need all of them, as described below. If you use a ~/.Xdefaults or ~/.Xresources file for setting resources, prefix all the lines with xterm:
```
        *wideChars: true
        *locale: true
        *utf8: 1
        *vt100Graphics: true
      
```
This turns on support for wide characters (this is enabled by the utf8 resource, too); enables conversions to UTF-8 from other locales (this is the key resource and actually overrides utf8); turns on UTF-8 mode (this resource is mostly used to force use of UTF-8 characters if your locale system isn't up to it); and allows certain graphic characters to work even with UTF-8 enabled. (Thanks to Phil Pennock for suggestions.)
The font. If you selected this from a menu in your terminal emulator, there's a good chance it already selected the right character set to go with it. If you hand-picked an old fashioned X font with a lot of dashes, you need to make sure it ends with the right character encoding, iso10646-1 (and not, for example, iso8859-1). Not all characters will be available in any font, and some fonts may have a more restricted range of Unicode characters than others.

As mentioned in the previous section, bindkey -m now outputs a warning message telling you that multibyte input from the terminal is likely not to work. (See 3.5 if you don't know what this feature does.) If your terminal doesn't have characters that need to be input as multibyte, however, you can still use the meta bindings and can ignore the warning message. Use bindkey -m 2>/dev/null to suppress it.

You might also note that the latest version of the Cygwin environment for Windows supports UTF-8. In previous versions, zsh was able to compile with the MULTIBYTE option enabled, but the system didn't provide full support for it.

5.4: How can I input characters that aren't on my keyboard?

Two functions are provided with zsh that help you input characters. As with all editing widgets implemented by functions, you need to mark the function for autoload, create the widget, and, if you are going to use it frequently, bind it to a key sequence. The following binds insert-composed-char to F5 on my keyboard:


    autoload -Uz insert-composed-char
    zle -N insert-composed-char
    bindkey '\e[15~' insert-composed-char

The two widgets are described in the zshcontrib(1) manual page, but here is a brief summary:

insert-composed-char is followed by two characters that are a mnemonic for a multibyte character. For example a: is a with an Umlaut; cH is the symbol for hearts on a playing card. Various accented characters, European and related alphabets, and punctuation and mathematical symbols are available. The mnemonics are mostly those given by RFC 1345, see http://www.faqs.org/rfcs/rfc1345.html.

insert-unicode-char is used to input a Unicode character by its hexadecimal number. This is the number given in the Unicode character charts, see for example http://www.unicode.org/charts/. You need to execute the function, then type the hexadecimal number (you can omit any leading zeroes), then execute the function again.

Both functions can be used without multibyte mode, provided the locale is correct and the character selected exists in the current character set; however, using UTF-8 massively extends the number of valid characters that can be produced.

If you have a recent X Window System installation, you might find the AltGr key helps you input accented Latin characters; for example on my keyboard AltGr-; e gives e with an acute accent. See also http://www.cl.cam.ac.uk/~mgk25/unicode.html#input for general information on entering Unicode characters from a keyboard.