ASCII and UTF-8

This page was translated by a robot.

ASCII is the abbreviation for American Standard Code for Information Interchange and is a standard for encoding text. It defines 95 printable characters and 33 control characters.

Printable characters:

+-------------------------+-------------------------+-------------------------+
|                         |                         |                         |
|     0x20 /  32  space   |     0x40 /  64    @     |     0x60 /  96    `     |
|     0x21 /  33    !     |     0x41 /  65    A     |     0x61 /  97    a     |
|     0x22 /  34    "     |     0x42 /  66    B     |     0x62 /  98    b     |
|     0x23 /  35    #     |     0x43 /  67    C     |     0x63 /  99    c     |
|     0x24 /  36    $     |     0x44 /  68    D     |     0x64 / 100    d     |
|     0x25 /  37    %     |     0x45 /  69    E     |     0x65 / 101    e     |
|     0x26 /  38    &     |     0x46 /  70    F     |     0x66 / 102    f     |
|     0x27 /  39    '     |     0x47 /  71    G     |     0x67 / 103    g     |
|     0x28 /  40    (     |     0x48 /  72    H     |     0x68 / 104    h     |
|     0x29 /  41    )     |     0x49 /  73    I     |     0x69 / 105    i     |
|     0x2a /  42    *     |     0x4a /  74    J     |     0x6a / 106    j     |
|     0x2b /  43    +     |     0x4b /  75    K     |     0x6b / 107    k     |
|     0x2c /  44    ,     |     0x4c /  76    L     |     0x6c / 108    l     |
|     0x2d /  45    -     |     0x4d /  77    M     |     0x6d / 109    m     |
|     0x2e /  46    .     |     0x4e /  78    N     |     0x6e / 110    n     |
|     0x2f /  47    /     |     0x4f /  79    O     |     0x6f / 111    o     |
|     0x30 /  48    0     |     0x50 /  80    P     |     0x70 / 112    p     |
|     0x31 /  49    1     |     0x51 /  81    Q     |     0x71 / 113    q     |
|     0x32 /  50    2     |     0x52 /  82    R     |     0x72 / 114    r     |
|     0x33 /  51    3     |     0x53 /  83    S     |     0x73 / 115    s     |
|     0x34 /  52    4     |     0x54 /  84    T     |     0x74 / 116    t     |
|     0x35 /  53    5     |     0x55 /  85    U     |     0x75 / 117    u     |
|     0x36 /  54    6     |     0x56 /  86    V     |     0x76 / 118    v     |
|     0x37 /  55    7     |     0x57 /  87    W     |     0x77 / 119    w     |
|     0x38 /  56    8     |     0x58 /  88    X     |     0x78 / 120    x     |
|     0x39 /  57    9     |     0x59 /  89    Y     |     0x79 / 121    y     |
|     0x3a /  58    :     |     0x5a /  90    Z     |     0x7a / 122    z     |
|     0x3b /  59    ;     |     0x5b /  91    [     |     0x7b / 123    {     |
|     0x3c /  60    <     |     0x5c /  92    \     |     0x7c / 124    |     |
|     0x3d /  61    =     |     0x5d /  93    ]     |     0x7d / 125    }     |
|     0x3e /  62    >     |     0x5e /  94    ^     |     0x7e / 126    ~     |
|     0x3f /  63    ?     |     0x5f /  95    _     |                         |
|                         |                         |                         |
+-------------------------+-------------------------+-------------------------+

Control characters:

+------+------+---------------------------+
| 0x00 | \0   | null character            |
| 0x01 |      | start of header           |
| 0x02 |      | start of text             |
| 0x03 |      | end of text               |
| 0x04 |      | end of transmission       |
| 0x05 |      | enquiry                   |
| 0x06 |      | acknowledge               |
| 0x07 |      | bell                      |
| 0x08 | \b   | backspace                 |
| 0x09 | \t   | horizontal tabulation     |
| 0x0a | \n   | line feed       (LF)      |
| 0x0b | \v   | vertical tabulation       |
| 0x0c | \f   | form feed                 |
| 0x0d | \r   | carriage return (CR)      |
| 0x0e |      | shift out                 |
| 0x0f |      | shift in                  |
| 0x10 |      | data link escape          |
| 0x11 |      | device control #1         |
| 0x12 |      | device control #2         |
| 0x13 |      | device control #3         |
| 0x14 |      | device control #4         |
| 0x15 |      | negative acknowledgement  |
| 0x16 |      | synchronous idle          |
| 0x17 |      | end of transmission block |
| 0x18 |      | cancel                    |
| 0x19 |      | end of medium             |
| 0x1a |      | substitute                |
| 0x1b | (\e) | escape                    |
| 0x1c |      | file separator            |
| 0x1d |      | group separator           |
| 0x1e |      | record separator          |
| 0x1f |      | unit separator            |
|      |      |                           |
| 0x7f |      | delete                    |
+------+------+---------------------------+

Details

The use of all ASCII characters in the C and C++ languages ​​is detailed on the Characters page .

The ASCII standard originated from the ISO 646 character encoding, which is also known as the invariant code set . ASCII corresponds to the US extension of ISO 646 and is therefore a superset of ISO 646. The languages ​​C and C++ are still compatible with ISO 646 in that they can be programmed using digraphs, trigraphs and the replacements using the standard library . Nowadays, however, the ASCII standard can generally be assumed. iso646

While ASCII is based on one of the oldest standards in the computer industry, it is built into nearly every system around the world, from supercomputers to washing machines. ASCII has stood the test of time and is considered the universal raw format for text files. For this reason, program code, log files or, for example, scientific raw data are still written in ASCII files today. ASCII files have neither formatting nor images, graphics, tables, fonts, etc. Many programs offer interfaces for importing and exporting ASCII files

ASCII is a 7-bit encoding where every 128 characters have a defined value. A byte nowadays usually consists of 8 bits. Thus the 7 bits of ASCII encoding are encoded into 8 bits. The surplus (most significant) bit is not defined and is used in different ways. This results in different character assignments, which is referred to as encoding . Today, UTF-8 has become the standard encoding. Other encodings such as the standard encodings for Windows and MacOS are found more and more rarely. See below.

In earlier times, the characters 0 - 31 and the character 127 had the task of controlling the input and output of text and are therefore called control characters. Most control characters are no longer used today, but some of them are still very important when working with strings or in programming in general. The most important control characters and some other characters can be rewritten in C and C++ using escape sequences ..

Characters 32 through 126 have a visual representation and are therefore called printable characters . The printable characters include Arabic numerals, Latin upper and lower case letters as well as a selection of punctuation marks and other characters that are used in particular when programming. All printable characters can occur without hesitation in text files. In contrast to the control characters, they have no special meaning but only represent the character that is described here. Character 32 is the space bar.

Newlines

A line break is coded differently depending on the system:

Unix:     LF     \n
Windows:  CR+LF  \r\n

There are other systems, which in turn use other combinations. For the everyday programmer, however, only the two line-ending versions Unix and Windows are generally important today.

It should be noted that each line ending variant can occur independently of the encoding and sometimes even appear mixed up within a single file. Unfortunately, even with modern format standards, no variant has yet been able to establish itself and continues to make life difficult for the programmer. Problems can arise, for example, when a file is opened in C using fopen()in ASCII mode. The line ends are automatically converted here, which can lead to different results on different systems, for example in index calculations. Furthermore, depending on the situation, system, application or protocol, it can be defined differently whether the combination CR+LFshould be counted as two individual characters or as a single character.

UTF-8 Encoding

Due to the 7-bit coding of ASCII, only a few characters can be mapped. This problem was only slightly mitigated by an extension to 8 bits, and at the same time created new problems with the compatibility of different systems, which still occur today.

With the introduction of Unicode, a standard with over a million characters was created. This set of characters is sufficient for a variety of characters such as the Greek letters, Asian characters, pictograms, mathematical symbols, musical notes, and more. Unicode was also known as the Universal Character Set (UCS) due to its broad coverage.

Unicode is now widespread and the standard of many systems. One problem that Unicode itself did not solve was compatibility with the ASCII standard, which is accepted worldwide. Unicode currently defines 21 bits, which is incompatible with ASCII's 7 bits. To address this problem, the UTF-8 encoding was designed, which makes ASCII upwardly compatible with Unicode.

UTF-8 is the abbreviation for 8-bit UCS transformation format and is capable of displaying all Unicode characters as well as the 7-bit ASCII characters without conversion. In current systems, the 7 bits of an ASCII character were always stored as 8 bits, with the leading bit being set to 0. UTF-8 has adopted this property and at the same time defined that if a 1 occurs in the leading bit, it is no longer an ASCII character but a Unicode character.

The non-ASCII characters are implemented in UTF-8 by sequencing several bytes, with the first byte of a character being the start byte. Depending on how the leading bits of the start byte are set, a different number of subsequent bytes can be added. Such a coding is also called multibyte coding . The conversion scheme is shown in the table below:

0xxxxxxx                             ASCII-Zeichen              
110xxxxx 10xxxxxx                    Unicode U+00080 bis U+007ff
1110xxxx 10xxxxxx 10xxxxxx           Unicode U+00800 bis U+0ffff
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx  Unicode U+10000 bis U+1ffff

The x denote any bits that are directly connected to each other during the conversion to Unicode, resulting in the characters listed in the right-hand column. It should be noted that low-order characters could theoretically be encoded in different ways. However, the UTF-8 standard dictates that only encoding with minimal space consumption is allowed.

The special coding with the leading bits also allows a byte to be identified directly as a start byte or as a follow-up byte. This is particularly important when restoring defective data or when counting the number of characters in a string.

Using UTF-8, the characters commonly used in the western world are automatically stored with fewer bytes, resulting in smaller files. However, the much more important advantage of this encoding is that existing ASCII files can be interpreted directly as UTF-8 files. For this reason, UTF-8 has spread worldwide and indirectly helped the spread of Unicode.

Windows Encoding

Window used to use different encodings depending on the language region. The most common was Windows-1252 Western European, an extension of the ISO-8859-1 encoding, better known as Latin-1. The encodings Windows-1252 Western European and ISO-8859-1 differ in the characters 128 - 159, which are considered undefined under the ISO-8859 class. Since these characters are rarely used, the two encodings were often mixed up by mistake.

The specification ISO-8859-1, or Latin-1, is far more common today than Windows-1252 and was even adopted for the character assignment of characters 128 - 255 in Unicode. Nevertheless, many files are still encoded in Windows-1252, which is why, for example, the encoding ISO-8859-1 should be interpreted as Windows-1252 according to the new HTML5 standard.

In addition to ISO-8859, there is also the standard ISO 8859 (no hyphen). This also does not define any printable characters for 128 - 159, but replaces them with additional control characters.

MacOS Encoding

The Macintosh system prior to version X used an encoding known as MacRoman. The character table can be looked up at other sources. In Mac OS X, however, this encoding was gradually superseded by the introduction of Unicode. Nowadays, files are generally saved in UTF-8.