Bits, Bytes, msb, Endianness

This page was translated by a robot.

In C and C++, values can be stored in one or more bytes, with each byte now typically consisting of 8 bits. A so-called significance is implicitly assigned to each bit that belongs to a value . The more significant bits are called the most significant bits (msb) and the less significant bits are called the least significant bits (lsb). The arrangement of the bits and bytes required for a value can vary depending on the processor or data format and is described by the so-called endianness (pronounced: endian-ness). Today there are mainly two variants of the arrangement in use: big endian and little endian.

Details

The smallest unit of information that a computer today can process is the bit (a pun on binary digit), which can have two states, known as 0and 1. If 8 bits are considered as a unit, 256 different combinations (states) can be formed. These 8 bits form the basic unit of bytes on today's processors . This size is assumed below. If several bytes are lined up, even more states can be mapped accordingly.

In today's computers, the byte defines the standard size for addressing. Addresses are treated as positive integers: The first byte has the address 0, the next the address 1, etc. Each value that consists of exactly one byte can be addressed directly with a unique address. For values consisting of several bytes (multi-byte), the address of the first byte (the one with the lowest address) is used.

All values in a computer are stored using bits. The representation of a value using bits is called a binary value . In the following, a distinction is made between the representation and storage of a binary value.

Representation of Binary Values

When representing a binary value, the bits of a value are usually listed from left to right with decreasing significance. This reflects the human representation of decimal numbers, where more significant digits are also further to the left and less significant digits are further to the right.

For example, if a single byte is treated as a positive integer (thus as a in C and C++ unsigned char), the bits can be listed as follows:

Bit #    7     6     5     4     3     2     1     0
----------------------------------------------------
       2^7   2^6   2^5   2^4   2^3   2^2   2^1   2^0
       128    64    32    16     8     4     2     1

Here the bit with the number 7 is referred to as the most significant bit, or in English as the most significant bit. Bit number 0 is called the least significant bit. The English terms are abbreviated to msb and lsb, all in lowercase (uppercase is used for the equivalent abbreviations with bytes, see below).

As another example, if a positive integer value consists of two bytes, the lsb is still bit number 0, but the msb is number 15.

Bit #     15     14     13     12   ...   3    2    1    0
----------------------------------------------------------
        2^15   2^14   2^13   2^12       2^3  2^2  2^1  2^0
       32768  16384   8192   4096         8    4    2    1

This list could be continued indefinitely for even higher-order integer types. As shown in the example, depending on the type of a value, a different bit can be the most-significant-bit.

The terms msb and lsb are also often used in the plural. Here, for example, most-significant bits designate a specific number of bits that are on the left edge of a value. However, such a designation is usually only used as a linguistic means, with the exact number of bits usually being clear from the context. For example, the shift-right operator fills the remaining most-significant bits after the shift with 0or 1.

It should be noted that for signed values, the most significant bit designates the sign bit. More information about this can be found in the encoding of integers

Storage of Binary Values

When storing binary values, the bits are not necessarily arranged in the way that would be usual when representing binary values. During the development phase of modern computers, several different processor architectures were introduced, which processed values in different bit orders. This is because, depending on the arrangement of the bits, the linking of transistors on the chip would have been more or less complicated and calculations could have been faster or slower depending on the situation. Today such effects play a minor role (or none at all), but the design decisions from back then have survived to this day.

Nowadays a programmer generally (exceptions are rare) only has to distinguish between two arrangements. They differ in the arrangement of the bytes in multi-byte values and are known as Big Endian and Little Endian . The specification of the order in which a value is stored is called endianness . There is no explanation on this page of what this term has to do with eggs.

As with bits, each byte of a multi-byte value can be implicitly assigned a significance. The more significant bytes are called the most significant bytes and the less significant bytes are called the least significant bytes. The abbreviations MSB and LSB are also used for this, but here with capital letters to make it clear that bytes and not bits are involved.

Big endian means most significant byte first and little endian means least significant byte first . The 8-byte value is given as an example 0xfedcba9876543210:

Address    0     1     2     3     4     5     6     7
------------------------------------------------------
Big       fe    dc    ba    98    76    54    32    21
Little    10    32    54    76    98    ba    dc    fe

Big endian basically corresponds to the representation of a binary value as shown above. The bytes are listed from left to right (ascending addresses) with decreasing significance. Little Endian lists the bytes with increasing significance. The arrangement of the bytes is thus exactly reversed in the two endians. However, it should be noted that the bits within the individual bytes have not changed.

It is easy for the reader to imagine what happens when a big-endian value is accidentally misinterpreted as a little-endian value. However, such confusion can only occur if the same data is used on different processors without reconversion. When binary files are exchanged between two computers, either the running program must know the endianness with which the data is to be saved or read, or the data format itself must define the endianness with which the data is saved. There are data formats that require values to be in little endian only, and there are other data formats that require values to be in big endian only. On the other hand, there are also data formats that use a flag in the file to define how the bytes are stored. However, many (quick and dirty) data formats do not define anything at all, and so it can always happen that data is misinterpreted.

Conversion between Little and Big

A few lines of code are enough to convert back and forth between the two endians most commonly used today. As an example, here the author gives an excerpt of the corresponding code for 64-bit values from the NALib:

NA_IAPI void naConvertLittleBig64(void* buffer){
  naSwap8(((NAByte*)buffer)+7, ((NAByte*)buffer)+0);
  naSwap8(((NAByte*)buffer)+6, ((NAByte*)buffer)+1);
  naSwap8(((NAByte*)buffer)+5, ((NAByte*)buffer)+2);
  naSwap8(((NAByte*)buffer)+4, ((NAByte*)buffer)+3);
}

For the standard types with 16, 32, 64 and maybe 128 bits these functions can be easily programmed out. Types with 8 bits do not need to be cast, since a cast between little and big only affects multi-byte values.

The conversion between the two endians is thus not very difficult. However, it is all the more difficult to find out what kind of endianness the current system actually has. Ideally, this information is already known at compile time and can actually be determined for any system, so to speak, using predefined macros. Unfortunately, however, there are a plethora of system combinations, which is why this automatic detection is very laborious and rarely used. Nonetheless, for example, uses NALiba few macros to do just that.

Another possibility is not to determine the endianness at compile time, but at runtime. The following example shows a simple way of distinguishing between little and big endian:









0x10: Little Endian

#include <stdio.h>

int main(){
  int test = 0x76543210;
  char firstbyte = ((char*)(&test))[0];
  if(firstbyte == 0x76){
    printf("0x%x: Big Endian\n", firstbyte);
  }else{
    printf("0x%x: Little Endian\n", firstbyte);
  }
}

This method is universal for systems with little or big endians, but not for systems with other endians (which, as described above, will hardly concern the reader). Furthermore, this method has the disadvantage that it always causes a branch, i.e. a branching of the control flow, which can have a strong impact on the runtime of the program if there are many calls. However, since an endianness conversion is usually only necessary for input and output, this conversion time is not of great importance. Nevertheless, the author recommends storing the endianness determined in this way in a variable, or even setting function pointers to corresponding conversions. The branch remains, but the code looks clean.

Furthermore, there are predefined functions in C and C++ that allow the endianness of the system (the so-called native endianness) to be converted into big endian and vice versa. However, a corresponding conversion for little endian does not exist. This is because the introduction of the Internet (which was originally the so-called Arpa-Net ) defined that all data on the Internet should be big endian. Thus, all systems had to provide appropriate conversions.

The functions are usually in the arpa/inetlibrary and are called htonsand htonland their inverse counterparts ntohsand ntohl. These function names mean Host to Network and Network to Host for types shortor , respectively long






Native: 0x76543210
Big Endian: 0x10325476

#include <stdio.h>
#include <arpa/inet.h>

int main(){
  int test = 0x76543210;
  printf("Native: 0x%x\n", test);
  printf("Big Endian: 0x%x\n", htonl(test));
}

Which method is most useful for the programmer is up to him. For network communication it is preferable to use the predefined standard functions. Both a static method using macro evaluation and a dynamic method at runtime are suitable for manual programming.

Next Chapter: Integer, Two's Complement