mick mccarthy / software engineer

.....articles.....

A Quick Overview of Character Encoding Schemes

[Wed, 01 Mar 2006]

Introduction

Most writing systems are based on the representation of consonants, vowels, syllables or meanings using graphical characters. Alphabets such as the Latin or Greek alphabets represent vowels and consonants by characters. Abjads such as Arabic or Hebrew normally represent only consonants by characters. Syllabic alphabets such as Bengali represent each syllable (a consonant and vowel combination) by a character. More complex scripts such as Japanese Kanji use characters that represent both a syllable and meaning.

Alphabets such as that used by the English language have a small number of characters in the range of 20 to 50 characters. However in this English-centric world of computing, it is easy to forget that complex scripts such as Chinese can have up to 10,000 different characters. Technology must therefore support character sets far larger than the small set the English alphabet requires.

A character set encoding allows the representation of a graphical character in a numerical format so that it may be manipulated and stored using software or hardware. We need to map each character to a numerical code that can then be converted to binary and represented as one or several bytes.

A single byte, with its 8 bits, can represent 256 values ranging from zero (all bits set to zero) to 255 (all bits set to 1). It is therefore apparent that using a single byte to represent each character in a complex script is not feasible several bytes are required. Various character encoding schemes have been developed that map a large range of characters to multiple byte sequences and some of these are now examined.

Comparison

Rather than examining the representation of every character for every encoding scheme, two representative characters are chosen and compared for each character encoding scheme. The characters used are the Latin alphabet's A and the Arabic character ۸, which is the number 8 . Since it all comes down to bits and bytes in the end, the byte values used to represent each character are represented in hexadecimal format.

ASCII

ASCII was first published as a standard by the American Standards Association in 1963. The latest version was published by ANSI (American National Standards Institute) in 1986 as ANSI specification X3.4-1986. ASCII characters are represented by 7 bits although extended versions are available that use the 8th bit to represent a larger range of characters. ASCII can therefore support a maximum of 128 characters which falls far short of the requirement to support large character sets. The ASCII specification is not freely available and must be purchased from ANSI.

Key features;

  • Characters represented by 7-bit code
  • Includes 95 printable characters
  • Includes non-printable character codes
Table 1: ASCII Example
Character Description Representation in Bytes
A Latin A 0x41
۸ not supported not supported

ISO/IEC 8859

The ISO/IEC developed this character encoding specification that represents a large range of characters using 8 bits. The specification is divided into 15 different parts, each of which deals with a specific group of characters. ISO/IEC 8859-1, also known as Latin1, represents the most common Latin alphabet characters. Final versions of the specification must be purchased from the ISO, however draft versions are available from the W3C and other sites online.

Features;

  • Each character represented by 8 bits i.e. 1 byte
  • 15 parts to the specification
  • Each part represents up to 256 characters
  • Some characters are present in more than one part of the specification
Table 2: ISO/IEC 8859 Example
Character Description Representation in Bytes
A ISO 8859-1 (Latin1) 0x41
ISO 8859-2 (Latin2) 0x41
ISO 8859-3 (Latin3) 0x41
ISO 8859-6 (Arabic) 0x41
۸ ISO 8859-6 (Arabic) not supported

Table 2 demonstrates that multiple tables have codes for a core set of characters. The table also demonstrates that part 6 does not support all Arabic characters as there is no representation for the character۸. As all characters must be stored in 1 byte and each part of the specification contains a core set of characters, the actual range of complex characters supported is small. Also, as each part uses the same character codes for different characters, the part of the specification that text uses must be known for the correct interpretation of characters.

Unicode

The Unicode Consortium, a U.S-based industry group, published a first version of this standard in 1991 and continues to develop the standard today. It is closely related to the ISO/IEC 10646 specification as the two organisations aim to keep the standards synchronised. Each character is represented by a uniquecode pointand when writing a code point it is normal to precede the code point number by U+ e.g. U+0041.

The Universal Character Set (UCS) contains all possible characters and this large set is divided into planes, each of which contains 65,536 characters (the maximum number that can be represented by 2 bytes). The Basic Multilingual Plane (BMP) is the first plane and it contains most of the commonly used characters.

Two types of encodings are available for Unicode characters. They can be encoded using the Unicode Transformation Format (UTF) format or using the UCS format, defined by ISO/IEC 10646.

Table 3: Unicode Code Points
Character Code Point
A U+0041
۸ U+06F8

UTF-8

This is a variable length encoding scheme that uses one to four bytes to encode each code point. RFC 3629 defines the UTF-8 encoding and it is also described in the Unicode specification. Code points from U+0000 to U+007F map directly to the ASCII character set therefore ASCII character codes form correct UTF-8 characters. Table 4 shows the number of bytes needed to encode a character depending on its code point.

Features;

  • Variable length encoding
  • Code points encoded using from 1 to 4 bytes
Table 4: UTF-8 Byte Requirements
Code Point Range Bytes Used
U+0000 – U+007F 1
U+0080 – U+07FF 2
U+0800 – U+0FFF 3
U+10000 – 4
Table 5: UTF-8 Example
Character Description Representation in Bytes
A Code point U+0041 0x41
۸ Code point U+06F8 0xDB 0xB8

UTF-16

This is a variable length encoding scheme that encodes code points as 16-bit code values. If a character is not in the BMP i.e. its code point is greater than U+FFFF, then two or more code values are used to represent the character. RFC 2781 defines the UTF-16 encoding and it is also described in the Unicode specification. Windows NT, Java and Qt use UTF-16 to internally represent characters.

As each code value requires 2 bytes, the endianess of the bytes must be communicated so that the code values are interpreted correctly. For this reason, the Byte Order Mark (BOM) is used. The code point U+FEFF is appended before the UTF-16 data. Therefore a byte sequence 0xFE 0xFF indicates the data is in big-endian order.

Features;

  • Variable length encoding scheme

  • A code point is represented by one or more two-byte code values

Table 6: UTF-16 Example
Character Description Representation in Bytes
A Code point U+0041 0x41
۸ Code point U+06F8 0x06 0xF8

Base64

Base64 is not a character encoding scheme but is nonetheless interesting. Base64 encoding is used to transform a sequence of byte values (they may be image information, program data or anything else) to a textual representation. In Base 64 the byte values are transformed to a subset of the ASCII printable characters. The character sequence may then be transmitted by a protocol that only accepts ASCII encoding and later it can be transformed to its original format. RFC 2045 describes Base64 encoding as used in the MIME email format.

Conclusion

Several of the most common character encoding schemes have been briefly described. Each scheme has its own advantages and disadvantages but all have a similar aim, which they achieve to varying degrees of success. Unicode is the most comprehensive specification and it continues to be developed, adding support for new character sets. Multiple other character encoding exist, however Unicode and ISO 10646 are the prevalent standards today.

References

http://www.omniglot.com/index.htm provides an informative overview of alphabets and complex writing systems.

http://www.ietf.org/rfc/rfc1345.txt  RFC 1345 provides standardised names for characters that should be used in other Internet protocol definitions.

http://www.ietf.org/rfc/rfc3629.txt  RFC 3629 describes the UTF-8 encoding scheme.

http://www.ietf.org/rfc/rfc2781.txt  RFC 2781 describes the UTF-16 encoding scheme.

http://www.ietf.org/rfc/rfc2045.txt  RFC 2045 describes the MIME email format and Base64 encoding.

http://en.wikipedia.org/wiki/Character_encoding  Wikipedia provides an extensive discussion of many character encoding schemes and standards.

Author Information

Michael McCarthy is a professional Software Engineer who specialises in Java and open source software development. For more information on his activities, visit http://www.mickmccarthy.com .