Back to Doug Ewell's Home Page

A survey of Unicode compression



Download   Download this paper:   PDF  •   PDF, zipped  •   Microsoft Word  •   Microsoft Word, zipped

This paper is also available as Unicode Technical Note #14.



Doug Ewell
January 30, 2004

Introduction

Index
Introduction
Legacy character sets
UTF-16 and UTF-8
SCSU
BOCU-1
Other compression formats
General-purpose compression
On two-layer compression and “the original text”   
Compression through normalization   
Conclusions
Appendix: a sample SCSU encoder
Notes
Acknowledgements
Version history
References

The Unicode (ISO/IEC 10646) coded character set is the largest of its kind. [1] Almost a million code positions are available in Unicode for formal character encoding, with more than 137,000 additional code positions reserved for private-use characters. This is quite a change from the 128 or 256 characters available in 8-bit “legacy” code pages, or even the thousands available in East Asian double-byte character sets (DBCS).

All that space comes at a price, however. To encode all the world’s scripts in a single character set, Unicode must use more than one or even two bytes for some characters. You never used to hear complaints about the “bloated” size of ASCII or Latin-1 text, but with Unicode and its huge capacity, that is a common concern today. Thus many users and companies have become interested in “Unicode compression.”

This paper will discuss compression of Unicode text for storage and interchange purposes, from the standpoint of “compact encodings” and general-purpose compression techniques. Two Unicode transfer encoding syntaxes that were specifically designed for compressing Unicode text will be examined. The motivation behind Unicode text compression will also be discussed. Finally, some suggestions will be made to improve the usefulness of the Unicode compression formats.

Legacy character sets

Before Unicode, every script that needed to be represented in a computer had its own character set. (In many cases, there were several to choose from!) This meant that the character set used in every text file had to be identified to guarantee proper interpretation. Of course, in practice most text files weren’t tagged by character set; instead, the user was expected to know which encoding was used, or be able to figure it out, or the software was supposed to figure it out. This sort of chaos is one of the main reasons Unicode was developed.

These single-script legacy character sets had one advantage, though. Because of their limited scope, they were often quite compact. The entire Greek alphabet could fit into a 128-character block (using ISO 8859-7), and so could the entire Cyrillic alphabet (using KOI8-R) or Thai alphabet (using TIS-620). Most documents didn’t contain a mixture of Greek and Russian and Thai text (and most still don’t), so characters in alphabetic scripts came to be thought of as something that could be represented in 8 bits.

In the case of Chinese, Japanese, and Korean (“CJK”) text, where a typical document might contain thousands of different ideographic Han characters, there never was any expectation that 8 bits per character would suffice. The legacy double-byte character sets designed for CJK text used a single byte for some characters (ASCII and halfwidth katakana) and two for others. DBCS encodings are trickier to handle than fixed-length encodings—programmers must keep track of lead and trail bytes—but at least these character sets represented CJK text in no more than 16 bits, as compactly as could be expected.

An important point to keep in mind as we explore the concept of Unicode compression is that in many circumstances, the goal is not to achieve the greatest possible compression, but to remove the “Unicode penalty” as compared to legacy character sets.

UTF-16 and UTF-8

Until version 3.1 (March 2001), Unicode was strictly defined in terms of “16-bit quantities.” The original 16-bit encoding form, later expanded to handle surrogates and named UTF-16, was considered the standard representation of Unicode. The concern about Unicode doubling the size of text stems directly from this encoding form, and to this day, “Unicode” still means “16 bits” to many people.

UTF-16 is not suitable for all applications, however. One axiom in the software industry is that “legacy” systems always survive longer than most people expect. As a new technology in the early 1990s, Unicode faced an uphill battle to coexist with established, ASCII-based operating systems and text-processing software. Not only did UTF-16 use 16 bits (often thought of as “2 bytes”) for all characters, but all bytes are possible. This caused problems for systems that treat certain bytes specially:

A character like U+5C3A (which exists in all four Microsoft CJK code pages), encoded in little-endian UTF-16 and fed to an ASCII-based, Unicode-unaware operating system, would look like “:\” (colon, backslash). Imagine the fun if this character were part of a file name.

To solve these problems and allow the adoption of Unicode in these situations, another encoding form, UTF-8, was devised. UTF-8 encodes all ASCII characters (including the “special” ones above) using the same bytes as in ASCII, and does not use those bytes to encode anything else. [4] Today UTF-8 is by far the most common Unicode text format.

One of the design goals for UTF-8 was that it “should not be extravagant in terms of number of bytes used for encoding.” A Unicode character encoded in UTF-8 uses between one and four bytes, as shown below:

  Range of code points     Bytes  
U+0000 – U+007F 1
U+0080 – U+07FF 2
U+0800 – U+FFFF 3
U+10000 – U+10FFFF 4

Because of the 1-byte representation of ASCII, many English-speaking users have come to regard UTF-8 as a compression format. This notion is puzzling to speakers of French and other Latin-script languages that use many non-ASCII characters, and ridiculous to users of almost every other script on Earth, for which UTF-8 uses at least one byte per character more than legacy encodings.

Although UTF-8 is indeed “not extravagant” in size, there is a big difference between avoiding wastefulness and achieving compression. The primary reason to use UTF-8 should be to maintain ASCII transparency, not to achieve compression.

(Incidentally, another common myth about UTF-8 is that its ASCII transparency is the result of a bias toward the English language. As mentioned earlier, the real motivation was transparent handling of characters considered “special” in existing operating systems and applications, not just control characters. In other words, the bias is toward syntax characters in markup languages, program source code, file systems, and command-line user interfaces.)

The official Unicode position on compression is that it constitutes a “higher-level protocol” outside the scope of the Unicode Standard. In time, two such protocols defining different Unicode compression formats emerged: SCSU and BOCU-1.

SCSU

The Standard Compression Scheme for Unicode (SCSU) is defined in Unicode Technical Standard #6 and is based on a technique originally developed by Reuters. The basic premise of SCSU is to define dynamically positioned windows into the Unicode code space, so that characters belonging to small scripts (such as the Greek alphabet or Indic abugidas) can be encoded in a single byte, representing an index into the active window. These windows are preset to blocks expected to be in common use (e.g. Cyrillic), so the encoder doesn’t have to define them in these cases. There are also static windows that cannot be adjusted.

For large scripts, including Han, SCSU features the ability to switch between this single-byte mode and what it calls “Unicode mode,” which is actually big-endian UTF-16 (UTF-16BE).

Instruction bytes called “tags” are inserted into the text to define and switch between windows and to change modes. There are also “quote” tags used to switch to another window for just the next character, and then return to the current window without requiring another tag byte. This is useful for encoding single characters that lie outside the current window, or characters whose value conflicts with another tag byte. Because the tags used in single-byte mode fall within the ASCII control range, and the tags used in Unicode mode correspond to the high 8 bits of private-use characters, the overhead of the quote tags to resolve conflicts is often not needed.

“Quote” tag Purpose
SQ0
Encode an ASCII control character while in single-byte mode.
 SQ1 through SQ7 
Switch to window 1 through 7 to encode just the next character.
SQU
Switch to Unicode mode to encode just the next character.
UQU
Encode certain BMP private-use characters while in Unicode mode. 

The net result of using SCSU is that text is encoded essentially as compactly as it would be in a legacy character set: 8 bits for each alphabetic character, 16 bits for each Han character. The tags add a relatively small amount of overhead. The only scripts not compressed well by SCSU are certain language-specific repertoires, like the subset of Canadian Syllabics used in Inuktitut and the subset of Latin used in Vietnamese, which are spread across several 128-character blocks and require frequent window switching. Even alphabetic scripts in the supplementary code space (beyond U+FFFF) can be compressed efficiently, by using the SDX tag to define an “extended” window. SCSU is especially effective at compressing short strings, since there is very little initial overhead.

Compared to a general-purpose decompressor, an SCSU decoder is fairly easy to program. Encoders are quite another story—one of the most common criticisms of SCSU is that it is difficult to write a decent encoder for it—but of course a general-purpose compression algorithm such as LZ77 is not simple either. An SCSU encoder can be trivially simple, very ambitious and complicated, or just about anywhere in between. The “difficulty” in writing an SCSU encoder is not in implementing the variety of SCSU tools (modes, window management, etc.), but in deciding how to use these tools to achieve good compression, and how much effort will yield “good enough” compression.

UTS #6 recommends that encoders should be “XML Suitable,” which means an initial sequence of Latin-1 text (U+0020 through U+00FF plus certain control characters) should be encoded as single bytes without tags. This recommendation disqualifies the trivial “bad encoder,” one that starts with a switch-to-Unicode-mode tag and encodes the entire input text as big-endian UTF-16. [5] It’s unlikely, however, that many professional-grade SCSU encoders will be affected by this recommendation.

SCSU in single-byte mode is ASCII-compatible, and if no window-switching or mode-switching tags are used, it is compatible with Latin-1 as well. The only exception to this Latin-1 transparency is that all ASCII control characters other than 00 (null), 09 (tab), 0A (LF), and 0D (CR) must be preceded by a quote tag. The most annoying impediment to ASCII control-code transparency is that 0C (form feed) must be quoted even though it isn’t used as an SCSU tag. Form feeds appear in a wide variety of text files, including nearly all Internet RFCs.

Here is an example of a short sentence, using the Latin and Cyrillic scripts, encoded in SCSU:

M o s c o w   i s   М о с к в а .
05 1C 4D 6F 73 63 6F 77 05 1D 20 69 73 20 12 9C BE C1 BA B2 B0 2E

In this example, the “curly” double quotation marks U+201C and U+201D are encoded using an SQ4 tag (byte 05), signaling the use of static window 4, followed by a byte indicating the offset of the desired character within the window. SCSU compressors usually encode non-ASCII punctuation characters like the curly quotes by quoting from static windows instead of switching to dynamic windows, because such characters usually appear in isolation and this approach is more efficient than switching windows. For the Cyrillic characters, the encoder used an SC2 tag (byte 12) to switch to dynamic window 2, which is pre-defined to the Cyrillic block starting at U+0400.

This example shows how a string of 19 Unicode characters can be compressed to 22 bytes; the same text in UTF-8 would require 29 bytes. Note that this sample string is only intended as a demonstration of how SCSU works, not as a benchmark. Actual compression performance will vary, depending on the input text.

Text encoded in SCSU can be prefixed by a signature, U+FEFF, also known as the byte order mark or BOM. Like UTF-8, SCSU is a byte-oriented encoding [6], so there is no need to indicate the byte order, but a signature identifying the following text as being encoded in SCSU may be useful in some situations. UTS #6 suggests the sequence 0E FE FF, which uses the SQU tag to introduce an isolated Unicode-mode character without changing the state of the encoder or decoder. This means the signature can be automatically added to or stripped from the beginning of a text stream without affecting the surrounding text, just as in UTF-8. However, prefixing a signature to an SCSU-encoded file means the file is no longer ASCII-transparent, which interferes with XML suitability.

SCSU is not perfect; it has some disadvantages that limit its attractiveness as a compression format. First, as mentioned above, there is the perception that a “good” SCSU encoder is difficult to write. In fact, a “good enough” encoder that takes advantage of some of the features of SCSU can still achieve good compression, and is less complex than many algorithms that have found their way into commonly used software. Single-character look-ahead, used in the sample encoders described at the end of this section, is often sufficient to achieve good compression. Apprehension over the effort required to create an “optimal” encoder should not be a major hindrance to the use of SCSU.

Another concern is that SCSU uses bytes in the ASCII control range, not only for tags in single-byte mode but as part of ordinary characters in Unicode mode. This makes SCSU unsuitable for protocols such as MIME (Multipurpose Internet Mail Extensions) that interpret these control bytes without decoding them. BOCU-1, described in the next section, solves this particular problem by avoiding the most important ASCII control bytes.

Still another criticism of SCSU is that the compressed encoding is non-deterministic—that is, there are multiple (correct) ways to compress the same input text. UTS #6 contains four examples to demonstrate how text might be encoded in SCSU. There are short sample texts in German and Russian, a longer sample in Japanese, and an artificial sample contrived to use all of the features of SCSU. The first two are straightforward, but the other two—particularly the Japanese example—could reasonably be compressed in many different ways, including some more efficient than the sample! Even if the same strategy were used, different encoders might choose different windows, resulting in trivial differences between encoded samples.

Finally, SCSU is criticized for not being supported in many browsers and other Unicode-aware text applications, which is also true but perhaps a bit self-fulfilling. Certainly a reputation that “SCSU isn’t widely supported” is unlikely to encourage developers and companies to add support for SCSU. Compressing and expanding SCSU text is not a black art; there are public domain implementations, most notably the sample Java implementation developed by Asmus Freytag in conjunction with UTS #6 and a decoder written in C by Roman Czyborra. SCSU is also supported in the ICU (International Components for Unicode) open-source library for developers, and in SC UniPad, a Unicode text editor for Windows. (A description of the SCSU compression algorithm used in UniPad is presented in the Appendix.)

BOCU-1

The Binary Ordered Compression for Unicode (BOCU) concept was developed in 2001 by Mark Davis and Markus Scherer for the ICU project. A specific application of the general BOCU algorithm, called BOCU-1, is described in Unicode Technical Note #6 (UTN #6, not to be confused with UTS #6, which defines SCSU). [7] Although BOCU-1 is even less widely used than SCSU so far, and supported by even fewer applications, it introduces some interesting concepts and brings a completely different approach to the problem of Unicode compression.

The main premise of BOCU-1 is to encode each Unicode character as the difference from the previous character, and to represent small differences in fewer bytes than large differences. By encoding differences, BOCU-1 achieves the same compression for all small alphabetic scripts, regardless of the block they reside in. BOCU-1 extends this basic concept in several ways:

A major design goal of BOCU-1, as the name implies, is to maintain binary ordering of the encoded text. That means that if a list of names is sorted in strict binary code point order, and each name in that list is encoded in BOCU-1, the encoded list will also be in binary code point order. Although binary order is often not culturally valid, it can be critically important for database applications and such, where “sorted” items need to stay sorted. BOCU-1 is deterministic; that is, all implementations of BOCU-1 encode the same text in exactly the same way. By contrast, text compressed with SCSU may use different sequences of bytes, and may not remain in code point order.

An additional design goal was to avoid the indiscriminate use of bytes corresponding to ASCII control characters, which might confuse certain text processes if they appeared at unexpected times (as in SCSU). As mentioned earlier, all ASCII control bytes are encoded as themselves, and 12 of them do not appear anywhere else within a BOCU-1 text stream. The creators of BOCU-1 use the term “MIME-friendly” to describe this property.

Like SCSU, BOCU-1 requires a very small amount of initial overhead—practically none unless a signature is used. Additionally, BOCU-1 maintains a very small amount of “state” from one character to the next—a single integer representing the adjusted base value—compared to SCSU, which must keep track of the current mode (single-byte vs. Unicode), the current window, and the position of all eight dynamic windows. The intent was to make BOCU-1 simpler to decode and (especially) encode than SCSU. (As we will see, this goal has been only partially met.)

Here’s the mixed Latin/Cyrillic sentence we saw before, this time encoded in BOCU-1:

M o s c o w   i s   М о с к в а .
F1 56 2E A0 BF C3 B3 BF C7 F1 57 20 2E BC C3 20 D3 D0 8E 91 8A 82 80 4B FA

In BOCU-1, each time the current character is in a different block from the one before, the resultant jump must be encoded in at least two bytes. Thus not only does each curly quote require a 2-byte encoding, so does the ASCII character immediately following. (The space character 20 is an exception, as described above.) In this example, our string of 19 Unicode characters is compressed to 25 bytes. Again, this is just a demonstration of how BOCU-1 works, not a true measure of its efficiency.

Because each character can influence the encoding of the following character, the use of the U+FEFF signature is not as straightforward as in SCSU. An initial U+FEFF can be encoded in BOCU-1 with the bytes FB EE 28, representing the distance from the starting base value of U+0040. But now the base value has been adjusted to U+FEC0, so the next character (U+201C in this example) must be encoded as the difference from that new base instead of the starting base. As a result, U+201C must be encoded with the three bytes 24 40 BA instead of the two bytes F1 56 shown in the example. This means the signature cannot be added or stripped blindly, as it can in SCSU.

This is especially important because, as the example shows, characters in the ASCII range are not encoded with the same bytes as in ASCII. For instance, the letter s (U+0073) is never encoded with the byte 73; even if preceded by another ASCII character it must be encoded as C3. As the Technical Note states, “the charset must be specified with a signature byte sequence or in a higher-level protocol.” Not all applications can tolerate this; an XML parser, for instance, would have to detect the BOCU-1 encoding or be told about it via a higher-level protocol, whereas other encodings could be identified by an ASCII-compatible declaration. There could be no such thing as an “XML Suitable” BOCU-1 encoder, in the sense that that term is defined for SCSU. It is not yet clear to what extent the lack of ASCII readability may hinder the acceptance of BOCU-1.

The simplicity of encoding and decoding BOCU-1 is also compromised because of the discontinuous range of bytes used for encoding. Because 20 of the 32 ASCII control bytes are also used to encode other characters, and the other 12 are not, BOCU-1 encoders and decoders must use a table-driven approach to determine the repertoire used for the encoded form. In some ways, this is a greater nuisance for decoders than for encoders (it complicates error checking, for example).

Perhaps the greatest drawback of BOCU-1, though, is that it is not formally specified anywhere at present, except in the program code accompanying UTN #6. In the words of the Technical Note:

The sample C code serves as the full specification of BOCU-1. Every conformant encoder and decoder must generate equivalent output and detect any illegal input code points and illegal input byte sequences.

Because there is no formal specification—as there is for UTF-16, UTF-8, SCSU, and virtually every other Unicode encoding and general-purpose compression algorithm ever devised—programmers can only create new BOCU-1 encoders and decoders by reverse-engineering the sample code. This may be an effective way for employers to determine an applicant’s programming skill, but it is not recommended engineering practice for an interoperable standard. To become widely adopted, BOCU-1 needs to be formally specified in its Technical Note, in a human language, as UTF-8 and SCSU are. (This is currently planned as an update to UTN #6.)

BOCU-1 is supported in the ICU library.

Other compression formats

SCSU and BOCU-1 are not the only possible approaches to Unicode compression. It’s easy to imagine a different character encoding scheme or transfer encoding syntax aimed at producing the smallest possible text. For example, there is a technique in the Sybase Unilib library (called simply “Compression”) in which runs of six or more Unicode characters within the same 256-character block are compressed by encoding only the least significant 8 bits of each character. The run is prefixed by the most significant 8 bits of the block and the number of characters in the run.

As described in the online Unilib Reference Manual, this approach “works particularly well for data encoded mostly in ISO Latin-1 (such as Windows ANSI and ISO 8859-1) or ASCII proper.” Naturally, not all Unicode text in the real world can benefit from this technique; words of fewer than six letters in Greek, Arabic, or Hindi cannot be compressed, because the runs are broken by ASCII spaces. (The Reference Manual also describes an implementation of the Reuters Compression Scheme for Unicode, the predecessor of SCSU.)

Between the widespread use of UTF-8 and the presence of SCSU and BOCU-1, it seems unlikely that any new Unicode compression formats will gain enough popularity to justify software support. There is, however, a way to compress Unicode text much more efficiently than any compact encoding can, and that is by using a general-purpose compression algorithm.

General-purpose compression

The concept of general-purpose (GP) data compression is familiar to most computer users, thanks to commercial utilities like PKZIP, WinZip, and StuffIt. GP compression algorithms are intended to provide good performance on all kinds of data, and while the effectiveness of some simple compression algorithms (such as RLE) can vary dramatically depending on the input, most commercial-grade utilities have an arsenal of algorithms to choose from, to ensure decent compression for any type of data.

One fundamental difference between GP compression and the Unicode compression formats we have looked at is that the output of GP compression is bit-oriented, rather than byte-oriented. An arbitrary bit combination, not necessarily along octet boundaries, may represent one or more input characters. Bit-oriented compression means that the bytes that are ultimately formed by serializing the compressed data to a file are arbitrary and may look totally random; they cannot be processed in any meaningful way except to decompress them. All bytes are possible, including those that are carefully avoided by Unicode compression formats for purposes like synchronization and interoperability. Data compressed by a GP technique must be transferred through a channel that does not attempt to interpret the bytes; for example, FTP protocols must be set to binary mode, as opposed to an ASCII mode where 0D and 0A are interpreted as CR and LF, or where 20s might be stripped as “trailing spaces.”

GP compression may be performed explicitly by the user, using a tool such as WinZip, or it may be performed implicitly by a piece of software. For example, a “closed” E-mail system such as Lotus Notes can be configured to compress an attachment automatically before sending it, and decompress it automatically when the message is delivered, to reduce bandwidth. The user is often unaware that data compression is being used, and the actual efficiency of the compression may not be critical.

One very simple type of GP compression is run-length encoding, or RLE. This technique compresses sequences, or runs, of identical data items by encoding one copy of the item together with the number of repetitions. For example, a string of 20 X’s:

XXXXXXXXXXXXXXXXXXXX

might be encoded as:

{ 20 } { X }

If the algorithm allows run-length-encoded data to be mixed with unencoded data, then the encoded sequence would have to be marked, like this:

{ RLE marker } { 20 } { X }

Then if the bit combination used for the RLE marker ever appeared as “real” data, it would have to be “escaped” or encoded specially to avoid being interpreted as a marker.

RLE is especially useful for bitmapped graphics, sound files, and similar data where there is a high correlation between adjacent data points. It is not especially useful for text because natural-language text rarely contains runs of more than two of the same character (except for spaces, hyphens, and the like). UTS #6 describes a “possible private extension” that adds an RLE mechanism to SCSU, but nobody uses it because of its limited value and, well, because nobody uses it.

Another, somewhat more advanced GP compression algorithm is called Huffman encoding. The basic principle of Huffman encoding is to encode frequently occurring data items in fewer bits and less frequent data items in more bits. This characteristic makes the Huffman scheme effective for compressing natural-language text, where a letter like E or A occurs much more frequently than X or Z.

Huffman encoding is associated with an information-theory concept called entropy, which is a measure of the unpredictability of a message, and by extension its compressibility, in terms of the minimum number of bits per data item needed to represent the message. In the case of text, of course, the “data items” are characters. For example, a message consisting of 10,000 mixed A’s and B’s, with equal probability, could be represented in 10,000 bits (1,250 bytes) by assigning 0 to A and 1 to B (or vice versa). [8] The entropy of this message is 1.00, indicating that an average of 1 bit per character is needed to encode the message, at a minimum. Higher numbers indicate less redundancy and less compressibility; one study shows a figure of 4.7 for “US-ASCII symbols in English text.” Huffman codes can always achieve compression within one bit of the entropy, and can be easily constructed by building a binary tree, with a character at each node.

To illustrate, suppose you have the message:

This is a test

There are 14 characters in this message, and 8 different characters. The entropy of this message is 2.84. It is possible, and in fact easy, to construct a Huffman tree such that the message can be encoded in only 40 bits, an average of 2.86 bits per character, which is very close to the theoretical minimum.

There is some overhead associated with Huffman encoding, because the tree structure used to devise the encoding must usually be stored along with the data. Without it, the string of 10,000 A’s and B’s described earlier (encoded as 0-bits and 1-bits) could just as well be a string of X’s and Y’s. The overhead is dependent on the number of items in the tree, and becomes less significant as the size of the message increases. Because of this, Huffman encoding is not suitable for compressing small text strings, such as names, something that SCSU and BOCU-1 were specifically designed to do well.

Some Huffman encoders, such as the Unix compact utility, employ alternative mechanisms to store the tree. There is also a technique called adaptive Huffman encoding, in which the tree is updated continuously by both encoder and decoder. This eliminates the need to store the tree and results in better compression, especially for large amounts of data, but updating the tree is a computationally expensive process, perhaps to the point of being impractical.

A related compression technique, arithmetic coding, is similar to Huffman encoding in that the degree of compression is based on the relative probability of each input data item. Arithmetic coding converts a sequence of characters to a single floating-point value, of arbitrary precision, within a range that depends on the number and probability of each input character. Because there is no single “correct” value that corresponds to the input sequence, but instead a range of values, the value can be stored in such a way as to use the minimum number of bits.

Each character contributes an arithmetically determined degree of precision to the final value, instead of being encoded in a fixed number of bits. Therefore, a character can be thought of as having been compressed into, say, 2.4 bits, rather than being constrained to 2 or 3 bits as in Huffman encoding. Arithmetic coding can thus achieve even better compression than Huffman encoding—substantially better if the probabilities are lopsided.

A significantly more complex GP compression algorithm, or family of algorithms, is named Lempel-Ziv after its inventors. Lempel-Ziv (LZ) compression is basically an adaptive dictionary technique in which runs of data are replaced with references to earlier occurrences of the same run. The basic LZ77 algorithm is the foundation of the “deflate” technique used in “zip” programs such as PKZIP, WinZip, and gzip, as well as the Unix compress utility. It is also the foundation of the much-improved LZW (Lempel-Ziv-Welch) algorithm, which is an integral part of the widely used GIF image format and which became notorious during the 1990s for being protected under a Unisys patent. The struggle and controversy over the Unisys LZW patent and licensing fees illustrates the difficulty of creating a really good GP compression algorithm.

The Burrows-Wheeler block-sorting algorithm, first described in a 1994 research paper, is an innovative preprocessing step in which strings of data are rotated and sorted to improve their compressibility in a subsequent stage, using available algorithms such as Huffman or arithmetic coding. (The major innovation was finding an easy way to undo the sorting operation.) Not only is the Burrows-Wheeler approach highly effective in general, but it also has the interesting property (described in the next section) of compressing Unicode text about equally well regardless of its original encoding format.

LZ and Burrows-Wheeler are very effective at compressing different types of data, including text. However, because LZ algorithms must maintain a dictionary of “earlier occurrences” of the same data, and Burrows-Wheeler requires large input blocks for best results, these algorithms—like Huffman—do not perform well on small text strings, as do SCSU and BOCU-1. In addition, they are computationally intensive and require large data tables, and thus need significant processing power and memory. Because of this, they are generally not suitable for small-scale platforms like mobile phones and handhelds, compared with compression formats and simpler GP compression schemes with their modest resource requirements.

On two-layer compression and “the original text”

Unicode compression formats and GP compression algorithms both work to reduce the size of Unicode text. One might be tempted to combine the two techniques, converting UTF-16 or UTF-8 text to SCSU or BOCU-1 and then condensing the result with a GP compression tool. The effectiveness of such a two-layer approach depends heavily on the GP compression algorithm, because certain GP algorithms are more encoding form sensitive than others.

As an example, consider Huffman encoding. Almost all existing Huffman implementations are designed to encode a sequence of bytes, not a sequence of 16-bit or 32-bit values. An 8-bit Huffman encoder would compress Unicode text by compressing each byte in the encoding form individually, not by compressing each character as one might expect. This means the performance of Huffman encoders is typically much lower for Unicode text than for ASCII, and is sensitive to the Unicode encoding form. For instance, the name of the Polish city Łódź contains only four characters, but six different bytes in either UTF-16 or UTF-8, which degrades Huffman performance by increasing the number of data items to be encoded without a corresponding reduction in entropy. The obvious solution is to build a Huffman encoder that operates directly on 16- or 32-bit Unicode code points, but such an encoder would require significantly more memory and run more slowly than traditional implementations. Failing that, it is clear that text encoded in SCSU or BOCU-1 can be compressed more effectively using the Huffman method than the same text encoded in UTF-16 or UTF-8.

By contrast, in the Burrows-Wheeler approach, the process of rotating and sorting strings of text that belong to a small alphabet and compressing the resulting “header” information achieves much the same effect as encoding the string in SCSU; the redundancy inherent in the encoding is largely discarded in either case. Because of this, the Burrows-Wheeler algorithm compresses Unicode text with roughly equal success regardless of the original format. There is not much to be gained by converting Unicode text to SCSU or BOCU-1 first if the text is going to be compressed using Burrows-Wheeler.

Even for algorithms like LZ, the argument that the compressibility of Unicode text depends on its encoding form has been questioned. In September 2002, Steven Atkin and Ryan Stansifer presented a paper at the 22nd International Unicode Conference titled “Unicode Compression: Does Size Really Matter?” The paper was updated in November 2002 and is currently available on the World Wide Web.

In their paper, Atkin and Stansifer argue that there is no significant advantage to storing Unicode text in a compression format such as SCSU or BOCU-1, compared to storing “the original text,” if general-purpose compression routines are used. The authors collected and created a total of 25 text files, including literature in many different languages and encodings (such as Anna Karenina in KOI8-R) and some artificial texts (such as a file of 12,000 A’s), and analyzed them using a variety of encodings and GP compression tools. The results were made available on the Web in a supplementary page. Their conclusion, as stated in the abstract to their paper, is that “as far as size is concerned, algorithms designed specifically for Unicode may not be necessary.”

Atkin and Stansifer’s paper specifically addresses the GP-compressibility of large text files encoded in various formats. It does not take up the issue of which format is most compact on its own, and it does not address the compressibility of small text strings like names, for which Unicode compression formats are best suited. The larger the text sample, the greater the benefit provided by a sophisticated GP compression algorithm, and the less significant its overhead.

Using bzip2, a compressor which employs the Burrows-Wheeler algorithm, even absurdly inefficient formats—such as representing each character by its full Unicode name (e.g. LATIN CAPITAL LETTER A WITH CIRCUMFLEX)—could be reduced to almost the same size as more compact formats. Atkin and Stansifer demonstrate that block-sorting compression techniques eliminate most of the redundancy of the encoding format. The supporting data compares the compressibility of each encoding format and declares a “winner,” which is somewhat misleading since the paper attempts to show that all formats are about equally compressible. (In some cases, the “winner” was only 0.01% smaller than another format!)

In contrast, the gzip compression tool, which uses LZ77, generally performed 15% to 25% better on natural-language, small-alphabet text encoded in SCSU or BOCU-1 than on the same text encoded in UTF-16. The authors claim that these differences are not significant, but they can hardly be considered negligible.

A significant distraction in the research of Atkin and Stansifer is that they compare the size and compressibility of the legacy character sets in which many of their text samples were acquired with that achieved by Unicode encoding schemes, transfer encoding syntaxes, and other formats (base64 and HTML entities among them). This has the effect, probably unintended, of portraying legacy character sets as a kind of highly efficient Unicode encoding format. The paper contains numerous references to “the original text”:

The original text file takes up the least space… Furthermore, the original text file is the most compressible format.

In the alphabetic scripts compressing the original text almost always results in a smaller file than first converting to UTF-8 or some other format and then compressing.

In almost all cases, “the original text” is in a legacy encoding such as ISO 8859, KOI8-R, TIS-620, EUC-JP, or ISCII. As explained earlier, legacy character sets are inherently more compact than Unicode encoding schemes and transfer encoding syntaxes because their capacity is much smaller. They are designed to encode all characters of a national or language-dependent alphabet into 8 bits, and cannot encode any others. [9] The compressibility of legacy encodings may be useful as a benchmark against which the compressibility of SCSU or other Unicode encoding formats can be judged, but it should not be seen as a justification for the continued use of legacy encodings. The real comparison should be among true Unicode formats.

Atkin and Stansifer make some worthwhile points with regard to the compressibility of different Unicode encoding schemes. Their research shows that the Unicode compression formats may not always provide the expected benefit when text is subsequently compressed using a general-purpose algorithm. However, the compressibility of legacy-encoded text is entirely irrelevant and should be disregarded when evaluating the results of their research.

Compression through normalization

In late 2003, a discussion on the Unicode public mailing list focused on the possibility of converting Unicode text to a different normalization form in order to improve compressibility. This approach may be referred to as compression through normalization.

Normalization forms are described in Unicode Standard Annex #15, which defines two types of equivalence relationships between Unicode text samples encoded with different code points: canonical and compatibility. The relationship we are interested in for compression purposes is canonical equivalence.

To cite one example, the Latin letter A followed by a combining diaeresis (◌̈) is canonically equivalent to the precomposed letter Ä. To cite another, the sequence of Hangul jamos ㅎㅏㄴ is canonically equivalent to the precomposed Hangul syllable . The principle behind compression through normalization is that converting text from one form to another—in these examples, to Normalization Form C—may improve the compressibility of the text, especially in the Hangul case (which was the original motivation for the discussion).

The question is whether converting text in this way falls legitimately within the scope of text compression. The answer depends, at least in part, on the interpretation of two Unicode conformance requirements:

C9 A process shall not assume that the interpretations of two canonical-equivalent character sequences are distinct.

C10 When a process purports not to modify the interpretation of a valid coded character representation, it shall make no change to that coded character representation other than the possible replacement of character sequences by their canonical-equivalent sequences or the deletion of noncharacter code points.

One view is that compression formats like SCSU and BOCU-1 are not general-purpose schemes, but are specifically designed to work with Unicode text and take advantage of its characteristics. Because of this, they are concerned with the interpretation of Unicode text rather than the exact code points used to represent it. If this is true, then canonical-equivalent sequences may be substituted, in accordance with requirement C10, to improve the compressibility of the text.

However, in many cases, the code points are important as well. For example, a security mechanism that uses checksums to verify that two strings are identical may fail if one of the strings is converted to a different normalization form. [10] This type of mechanism would conform to requirement C9, which is concerned only with the interpretation of character sequences, and does not apply to processes that are sensitive to code-point equivalence.

Additionally, different canonical-equivalent sequences may be displayed differently by certain fonts and rendering engines. Not all “Unicode-compliant” display engines can render all canonical-equivalent sequences correctly, and even those that can do so may display different glyphs for different sequences. Users may prefer the sequence known to generate the “best” glyph(s), making normalization an undesirable event.

Text compression techniques are generally assumed to be “lossless,” meaning that no information will be altered by compressing and decompressing the text. This is not always the case for other types of data; in particular, video and audio formats often incorporate some form of “lossy” compression where the benefit of reduced size outweighs the potential degradation of the original image or sample. Because Unicode provides for multiple equivalent representations of the same text, the line between “lossless” and “lossy” is not as clear as in other character encoding standards. However, a text compression scheme that changes the identity and number of code points in its input stream would likely be viewed as performing an unexpected and unwanted transformation.

Consequently, Unicode text compression schemes should not convert their input to another normalization form unless it is known that the distinction between code-point sequences is not significant to the process that will receive the output. Compression through normalization should especially be avoided if there is any chance that it will cause security checks or other equivalence tests to fail.

Conclusions

There are many ways to compress Unicode text, and it is important to consider the goal that one wishes to accomplish before deciding on a solution. If the goal is to achieve the smallest possible size—for transmission or long-term storage, for example—then general-purpose compression should be used. This would be true regardless of whether the text was originally encoded in ASCII, EBCDIC, Unicode, or any other encoding.

On the other hand, if the goal is simply to reduce the “Unicode penalty” over legacy encodings, then it is worth considering the compression formats, SCSU and BOCU-1. Both of these formats reduce Unicode text to a size comparable to that of legacy encodings, while retaining all the advantages of Unicode. A combination of Unicode compression formats and GP compression schemes may achieve the greatest efficiency, or it may not, depending on the text being compressed and the GP algorithm being employed. It is certainly worth considering whether the added complexity is justified.

English speakers with an occasional need for non-ASCII characters will probably be satisfied with the “compression” offered by UTF-8, but speakers of other languages, and especially users of other scripts, will not realize the same advantage. Under no circumstances should legacy encodings be perceived as a way to compress Unicode text. They should only be used for compatibility with non-Unicode systems or if the text must remain in the legacy encoding for some other reason.

There is a possible problem with encouraging the widespread use of SCSU and BOCU-1. Many casual users are already confused by the large number of encoding schemes associated with Unicode—UTF-32, UTF-16, and UTF-8, in big- or little-endian order, with or without signature. For many, another Unicode file format—for compression or any other purpose—would not be a welcome sight. There is almost certainly no need for a third Unicode compression format, unless some unanticipated, extraordinary situation arises.

Some steps could be taken to make the adoption of SCSU and BOCU-1 more practical. First, rather obviously, Web browsers and other Unicode-aware text tools should be enhanced to support SCSU, and perhaps BOCU-1 as well. HTML pages in particular could benefit from a reduction in size and consequent download time, and the new “XML Suitability” recommendation would all but ensure that SCSU-encoded pages would work well on the Web. Microsoft Internet Explorer 6.0 already supports more than 30 legacy encodings; adding SCSU support to IE would be a relatively simple and forward-thinking improvement. (Remember that decoding SCSU text is actually quite easy; encoding it is what has earned SCSU a reputation for complexity.) This would encourage the use of SCSU for the large number of text processes (such as HTTP) where GP compression is not used, or not visible to the user.

There should also be a movement to encourage the use of signatures in SCSU and BOCU-1 plain-text files and streams. Some operating systems and applications do not work well with the U+FEFF signature in UTF-8 because it interferes with application-specific file signatures or prevents blind concatenation. In particular, XML and HTML files in an ASCII-compatible encoding must not contain a signature (encoding declarations should be used instead). However, ordinary plain-text files are not subject to these constraints. The Notepad utility in Microsoft Windows 2000 and XP uses signatures to identify UTF-8 and UTF-16 text, with great success; using signatures for SCSU and BOCU-1 files could be similarly effective. For BOCU-1 in particular, the use of a file signature is absolutely essential because of the lack of ASCII readability. U+2060 WORD JOINER, which was added to Unicode 3.1.1, is now “strongly preferred” over U+FEFF as a zero-width no-break space. As this recommendation gains wider acceptance, U+FEFF will become easier to identify as a signature.

Without question, a formal written specification of BOCU-1 needs to be created and added as an update to UTN #6. There is little chance of BOCU-1 being adopted by industry or anyone else as long as its normative reference is a sample C implementation.

Finally, the ASCII readability and resultant usefulness of SCSU could be improved by deciding what to do about the byte 0C. This byte, of course, is the ASCII “form feed,” a widely used character in text documents on the Internet and elsewhere. It is not an SCSU tag, but still must be escaped with a quote tag, just as if it were a tag. In UTS #6, Table 6, “Single Mode Tag Values,” the byte 0C is marked as “reserved for future use”; yet until recently the Summary stated, “The design phase is completed with the publication of this technical report,” implying that no new tags would be created. The Unicode Technical Committee should consider resolving this conflict by either (a) defining a new SCSU single-byte-mode function, for which 0C would serve as the tag, or (b) removing the quoting restriction and allowing 0C to pass as a form feed, without having to be quoted, just as CR and LF currently are. Option (b) seems much more desirable, and could be implemented in a flexible way by allowing 0C to be either quoted (for compatibility with existing encoders) or not (for improved ASCII readability). Whether this could be accomplished without creating two different, “incompatible” variants of SCSU has not yet been determined conclusively.

 

Appendix: a sample SCSU encoder

UTS #6 provides a remarkably thorough and well-written description of SCSU. Despite this, it is difficult to find applications that support SCSU, and even more difficult to find information or advice on writing good SCSU code, on the Internet or elsewhere. To be sure, there are sample implementations to be found—from the Unicode Consortium, the ICU project, and elsewhere—but studying sample code is not always the best way to understand the underlying algorithms. SCSU code in particular has a tendency to be sparsely commented, filled with magic numbers, and obscured by details unrelated to SCSU (such as UTF-16 or UTF-8 bit-shifting logic and stream I/O).

In his book Unicode Demystified, Richard Gillam describes two possible strategies for implementing SCSU. The strategy outlined below is a bit more detailed, based on the real-world encoder used in SC UniPad. This algorithm is intended to be part of a loop, such that the Unicode character to be compressed is already available, and makes the following assumptions:

  1. The encoder starts in the default state (single-byte mode, all dynamic windows in their default positions, dynamic window 0 active).
  2. The state of the encoder (current mode, positions of dynamic windows, active window) is maintained from one iteration to the next.
  3. For a given character C:

    1. the byte BS is defined as the offset of C within the static window being quoted, and
    2. the byte BD is defined as the offset of C within the dynamic window that is currently active (or being quoted), plus 80 (hex).
    For example, if dynamic window 3 is active and set to the Arabic block starting at U+0600, then the character U+0627 would be encoded with the BD byte A7 (0627 – 0600 + 80). The names BS and BD are not part of the formal definition of SCSU, but help simplify this description.
  4. A compressible character is defined as one with a Unicode scalar value less than U+3400, or greater than or equal to U+E000. These are the characters for which it is possible to define a dynamic window.
  5. Whenever the encoder outputs a tag—to change modes or windows, or to reposition a dynamic window—it performs the same operation internally as well. That is, when issuing an SCU tag, the encoder itself must change from single-byte mode to Unicode mode.
  6. The least recently used dynamic window is always the first to be reassigned. Every time a character is used from a given window, the “age” of that window is updated so the encoder can tell which was least recently used.

In single-byte mode:

In Unicode mode:

There are some opportunities for improvement in this algorithm. For example, in single-byte mode, the encoder always checks static windows before defining a new dynamic window. This is usually a good idea, but if the next two non-ASCII characters can also fit into the same window, it would be more efficient to define a dynamic window and switch to it once, rather than using three quote tags.

Similarly, when changing from Unicode mode to single-byte mode, the encoder must select an active window. Instead of automatically selecting the window used last time, it would be better to look ahead to determine which window will be the next one needed.

Still, this is a fairly effective algorithm—good enough to compress the 116-character Japanese sample text in UTS #6 to only 177 bytes, compared with 178 in the example. This is particularly respectable considering the limitations of one-character look-ahead. It is also possible to add special compression logic to improve the encoder’s performance on certain classes of text, such as those that use multiple windows (for example, Vietnamese or polytonic Greek). These techniques add complexity to the encoder, and it is up to each developer to decide whether the improved performance is worth the effort.

Notes

[1] The Chinese standard GB 18030-2000 is essentially a retrofit of ISO/IEC 10646 onto the existing GB 2312 framework. The repertoire of GB 18030 is determined by ISO 10646.

[2] In this paper, numbers in this font are assumed to be hexadecimal.

[3] In ISO/IEC 2022 terminology, this range is known as the primary or C0 (“C-zero”) set of control characters.

[4] A previous attempt, UTF-1, preserved the control bytes but not the printable ASCII bytes, and therefore did not solve the problem of slashes in Unix file names.

[5] Even then, it’s not that simple: characters that conflict with Unicode-mode tag bytes must still be quoted.

[6] Even in “Unicode mode,” SCSU tags are interpreted as single bytes.

[7] Technical Notes are not officially endorsed by the Unicode Consortium, unlike (for example) the Unicode Technical Standard that defines SCSU. They are presented for information only.

[8] Note that RLE or other techniques could do better than this, especially if the data can be correlated in some way. For example, a message consisting of 5,000 A’s followed by 5,000 B’s can be compressed by PKZIP into only 224 bits, plus overhead.

[9] Almost all legacy character sets include ASCII as a subset, so they can encode (say) English and Russian text in the same document, or English and Arabic. They cannot encode Russian and Arabic, however.

[10] Of course, it would also fail if the string were converted to a different encoding scheme, such as from UTF-16 to UTF-8.

Acknowledgements

Thanks to Steven Atkin, John Cowan, Mark Davis, Asmus Freytag, James Kass, Rick McGowan, and Markus Scherer for reviewing this paper and offering many useful suggestions, and to Torsten Mohrin for his original suggestion to “tell something about your SCSU encoder.”

Version history

Date Description
January 1, 2004 Initial release.
January 30, 2004 Corrected an error in the description of BOCU-1.

References

Atkin, Steven and Stansifer, Ryan. Unicode Compression: Does Size Really Matter? Updated November 2002 from the presentation at the 22nd International Unicode Conference (San José, Calif., September 2002). Available online at http://www.cs.fit.edu/~satkin/docs/rep.pdf. Original presentation available at http://www.cs.fit.edu/~ryan/compress/compress.pdf. Sample data available at http://www.cs.fit.edu/~ryan/compress/.

Becker, Joseph. E-mail message on public Unicode mailing list (July 2000), containing the original UTF-8 proposal by Ken Thompson. Available online at http://groups.yahoo.com/group/unicode/message/3235.

Brettin, Mark. Huffman Encoding. Available online at http://www.la.unm.edu/~mbrettin/algorithms/huffman.html.

Burrows, M. and Wheeler, D.J. A Block-sorting Lossless Data Compression Algorithm. SRC Research Report 124. Palo Alto, Calif.: Digital Systems Research Center (May 1994). Available online at ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.gz.

Czyborra, Roman. Sample SCSU decoder available online at http://czyborra.com/scsu/scsu.c. See also http://czyborra.com/utf/.

Davis, Mark et al. Unicode 3.1. Unicode Standard Annex #27, tracking number 4 (May 2001). Available online at http://www.unicode.org/reports/tr27/.

Davis, Mark and Dürst, Martin. Unicode Normalization Forms. Unicode Standard Annex #15, tracking number 23 (April 2003). Available online at http://www.unicode.org/reports/tr15/.

ECMA. Character Code Structure and Extension Techniques. Standard ECMA-35 (6th Edition, December 1994). Available online at http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-035.pdf.

Freed, Ned and Borenstein, Nathaniel. Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies. RFC 2045 (November 1996). Available online at http://www.ietf.org/rfc/rfc2045.txt.

Freed, Ned and Borenstein, Nathaniel. Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types. RFC 2046 (November 1996). Available online at http://www.ietf.org/rfc/rfc2046.txt.

Gillam, Richard T. Unicode Demystified: A Practical Programmer’s Guide to the Encoding Standard. Reading, Mass.: Addison-Wesley, 2002 (ISBN 0-201-70052-2).

International Components for Unicode. Home page at http://www-124.ibm.com/icu/.

ISO/IEC JTC 1/SC2/WG2. UCS Transformation Format One (UTF-1), amended June 1995. Available online in the ISO International Register of Coded Character Sets, at http://www.itscj.ipsj.or.jp/ISO-IR/178.pdf.

McIlroy, Douglas. Data compression. Unit in Computer Science 15, Data Structures and Programming, Fall 2002, Dartmouth College. Formerly available online at http://www.cs.dartmouth.edu/~doug/cs15/compression.html.

Nelson, Mark. Arithmetic Coding + Statistical Modeling = Data Compression. From Dr. Dobb’s Journal (February 1991). Available online at http://dogma.net/markn/articles/arith/part1.htm.

Scherer, Markus W. Compact Encodings of Unicode. Presentation at the 22nd International Unicode Conference (San José, Calif., September 2002). Available online at http://oss.software.ibm.com/icu/docs/papers/compact_encodings_iuc22.ppt.

Scherer, Markus W. and Davis, Mark. BOCU-1: MIME-Compatible Unicode Compression. Unicode Technical Note #6, version 1 (August 2002). Available online at http://www.unicode.org/notes/tn6/.

Seward, Julian. bzip2 and libbzip2: a program and library for data compression (December 2001). Available online at ftp://sources.redhat.com/pub/bzip2/docs/manual.pdf. Official home page for bzip2 at http://sources.redhat.com/bzip2/.

Sybase Inc. Unilib Reference Manual, Chapter 4, “Unicode Compression.” Available online at http://manuals.sybase.com/onlinebooks/group-ucarc/ucg0200e/ulrefman/@Generic__BookTextView/7765#X.

The Unicode Consortium. Archives of the Unicode Mail List. Available online at http://www.unicode.org/mail-arch/ (search for keywords “compression” and “normalization”).

The Unicode Consortium. Frequently Asked Questions: Compression (updated September 2003; answers contributed by Asmus Freytag). Available online at http://www.unicode.org/faq/compression.html.

The Unicode Consortium. The Unicode Standard, Version 4.0.0, defined by The Unicode Standard, Version 4.0. Boston, Mass.: Addison-Wesley, 2003 (ISBN 0-321-18578-1).

Wolf, Misha et al. A Standard Compression Scheme for Unicode. Unicode Technical Standard #6, version 3.5 (July 2002). Available online at http://www.unicode.org/reports/tr6/. Sample implementation available at ftp://ftp.unicode.org/Public/PROGRAMS/SCSU/.

The World Wide Web Consortium (W3C). Extensible Markup Language (XML) 1.0 (Second Edition). W3C Recommendation (October 2000). Available online at http://www.w3.org/TR/REC-xml.

The World Wide Web Consortium (W3C). HTML 4.01 Specification. W3C Recommendation (December 1999). Available online at http://www.w3.org/TR/html4/.


Unicode Encoded Valid XHTML 1.0 Valid CSS