How To Store 16 Bit Number To Register In Lc3

A Tutorial on Data Representation

Integers, Floating-indicate Numbers, and Characters

Number Systems

Homo beings use decimal (base of operations 10) and duodecimal (base of operations 12) number systems for counting and measurements (probably because we have 10 fingers and ii big toes). Computers use binary (base 2) number organisation, as they are made from binary digital components (known every bit transistors) operating in two states - on and off. In calculating, we too apply hexadecimal (base sixteen) or octal (base 8) number systems, every bit a compact form for representing binary numbers.

Decimal (Base x) Number System

Decimal number system has 10 symbols: 0, 1, ii, three, 4, 5, 6, 7, 8, and 9, called digits. It uses positional notation. That is, the least-significant digit (right-most digit) is of the order of x^0 (units or ones), the second right-most digit is of the order of 10^1 (tens), the tertiary right-most digit is of the social club of ten^2 (hundreds), and so on, where ^ denotes exponent. For instance,

735 = 700 + 30 + five = vii×10^2 + 3×10^one + v×10^0

We shall announce a decimal number with an optional suffix D if ambiguity arises.

Binary (Base 2) Number Arrangement

Binary number organisation has two symbols: 0 and 1, called bits. It is also a positional notation, for case,

10110B = 10000B + 0000B + 100B + 10B + 0B = 1×2^4 + 0×2^3 + 1×ii^2 + 1×two^i + 0×2^0

Nosotros shall denote a binary number with a suffix B. Some programming languages announce binary numbers with prefix 0b or 0B (e.1000., 0b1001000), or prefix b with the $.25 quoted (e.g., b'10001111').

A binary digit is called a bit. Eight bits is called a byte (why eight-bit unit? Probably considering 8=2^three ).

Hexadecimal (Base xvi) Number System

Hexadecimal number system uses 16 symbols: 0, ane, ii, iii, 4, 5, 6, seven, viii, 9, A, B, C, D, E, and F, called hex digits. Information technology is a positional notation, for example,

A3EH = A00H + 30H + EH = ten×16^2 + 3×sixteen^ane + 14×16^0

Nosotros shall denote a hexadecimal number (in short, hex) with a suffix H. Some programming languages denote hex numbers with prefix 0x or 0X (due east.g., 0x1A3C5F), or prefix 10 with hex digits quoted (e.g., x'C3A4D98B').

Each hexadecimal digit is also called a hex digit. Most programming languages accept lowercase 'a' to 'f' also as uppercase 'A' to 'F'.

Computers uses binary organisation in their internal operations, equally they are built from binary digital electronic components with 2 states - on and off. However, writing or reading a long sequence of binary bits is cumbersome and error-prone (effort to read this binary cord: 1011 0011 0100 0011 0001 1101 0001 1000B, which is the same as hexadecimal B343 1D18H). Hexadecimal system is used as a compact class or shorthand for binary bits. Each hex digit is equivalent to iv binary bits, i.e., shorthand for 4 bits, as follows:

Hexadecimal	Binary	Decimal
0	0000	0
1	0001	1
2	0010	2
3	0011	3
4	0100	4
v	0101	5
six	0110	six
7	0111	vii
viii	grand	eight
9	1001	9
A	1010	10
B	1011	11
C	1100	12
D	1101	xiii
E	1110	14
F	1111	xv

Conversion from Hexadecimal to Binary

Replace each hex digit by the 4 equivalent $.25 (every bit listed in the above tabular array), for examples,

A3C5H = 1010 0011 1100 0101B 102AH = 0001 0000 0010 1010B

Conversion from Binary to Hexadecimal

Starting from the right-most flake (least-significant bit), replace each group of 4 bits by the equivalent hex digit (pad the left-most $.25 with zero if necessary), for examples,

1001001010B = 0010 0100 1010B = 24AH 10001011001011B = 0010 0010 1100 1011B = 22CBH

It is important to note that hexadecimal number provides a compact form or shorthand for representing binary bits.

Conversion from Base of operations `r` to Decimal (Base of operations ten)

Given a north-digit base r number: d_n-1d_n-2d_n-three...d₂d_id₀ (base r), the decimal equivalent is given by:

d_{northward-one}×r^n-1          + d_north-2×r^n-2          + ... + d₁×r^ane          + d₀×r⁰

For examples,

A1C2H = 10×sixteen^3 + 1×16^2 + 12×16^1 + 2 = 41410 (base ten) 10110B = 1×2^4 + 1×two^two + i×two^1 = 22 (base 10)

Conversion from Decimal (Base 10) to Base `r`

Use repeated division/residuum. For example,

To convert 261(base 10) to hexadecimal:   261/xvi => quotient=sixteen balance=5   16/sixteen  => quotient=1  rest=0   1/16   => caliber=0  remainder=1 (quotient=0 end)   Hence, 261D = 105H (Collect the hex digits from the balance in contrary order)

The in a higher place procedure is actually applicable to conversion betwixt any 2 base systems. For example,

To convert 1023(base of operations 4) to base of operations 3:   1023(base 4)/3 => quotient=25D remainder=0   25D/three          => quotient=8D  balance=1   8D/3           => quotient=2d  residuum=two   2D/iii           => quotient=0   remainder=2 (quotient=0 end)   Hence, 1023(base 4) = 2210(base 3)

Conversion between Ii Number Systems with Fractional Role

Divide the integral and the fractional parts.
For the integral office, divide by the target radix repeatably, and collect the ramainder in contrary order.
For the fractional role, multiply the partial role past the target radix repeatably, and collect the integral office in the same guild.

Example i: Decimal to Binary

Convert 18.6875D to binary Integral Function = 18D   xviii/2 => quotient=9 residue=0   9/2  => quotient=4 remainder=ane   4/2  => quotient=2 remainder=0   2/2  => caliber=one residuum=0   1/two  => quotient=0 remainder=1 (quotient=0 cease)   Hence, 18D = 10010B Fractional Function = .6875D   .6875*two=ane.375 => whole number is 1   .375*2=0.75   => whole number is 0   .75*ii=one.5     => whole number is i   .five*two=i.0      => whole number is i   Hence .6875D = .1011B Combine, 18.6875D = 10010.1011B

Example ii: Decimal to Hexadecimal

Convert 18.6875D to hexadecimal Integral Part = 18D   xviii/16 => quotient=1 residual=2   1/16  => quotient=0 remainder=1 (quotient=0 stop)   Hence, 18D = 12H Fractional Function = .6875D   .6875*16=11.0 => whole number is 11D (BH)   Hence .6875D = .BH Combine, 18.6875D = 12.BH

Exercises (Number Systems Conversion)

Convert the following decimal numbers into binary and hexadecimal numbers:
1. 108
2. 4848
3. 9000
Catechumen the post-obit binary numbers into hexadecimal and decimal numbers:
1. 1000011000
2. 10000000
3. 101010101010
Catechumen the post-obit hexadecimal numbers into binary and decimal numbers:
1. ABCDE
2. 1234
3. 80F
Convert the following decimal numbers into binary equivalent:
1. xix.25D
2. 123.456D

Answers: Y'all could employ the Windows' Computer (calc.exe) to behave out number organisation conversion, by setting it to the Programmer or scientific mode. (Run "calc" ⇒ Select "Settings" menu ⇒ Choose "Programmer" or "Scientific" mode.)

1101100B, 1001011110000B, 10001100101000B, 6CH, 12F0H, 2328H.
218H, 80H, AAAH, 536D, 128D, 2730D.
10101011110011011110B, 1001000110100B, 100000001111B, 703710D, 4660D, 2063D.
?? (You piece of work it out!)

Computer Memory & Data Representation

Computer uses a fixed number of bits to represent a piece of data, which could be a number, a character, or others. A n-bit storage location can represent up to 2^north distinct entities. For case, a 3-bit retention location can hold one of these eight binary patterns: 000, 001, 010, 011, 100, 101, 110, or 111. Hence, information technology can represent at most viii distinct entities. You lot could use them to stand for numbers 0 to seven, numbers 8881 to 8888, characters 'A' to 'H', or upwards to 8 kinds of fruits similar apple, orange, assistant; or up to 8 kinds of animals like lion, tiger, etc.

Integers, for example, can be represented in 8-flake, 16-bit, 32-bit or 64-bit. You lot, equally the programmer, choose an advisable bit-length for your integers. Your selection will impose constraint on the range of integers that can exist represented. Besides the flake-length, an integer can be represented in diverse representation schemes, e.1000., unsigned vs. signed integers. An viii-bit unsigned integer has a range of 0 to 255, while an 8-bit signed integer has a range of -128 to 127 - both representing 256 distinct numbers.

Information technology is important to note that a computer retentivity location merely stores a binary blueprint. It is entirely up to you, equally the programmer, to decide on how these patterns are to be interpreted. For instance, the viii-bit binary blueprint "0100 0001B" can be interpreted as an unsigned integer 65, or an ASCII character 'A', or some secret information known but to you lot. In other words, you take to beginning determine how to represent a piece of data in a binary design before the binary patterns make sense. The interpretation of binary pattern is called data representation or encoding. Furthermore, it is important that the information representation schemes are agreed-upon by all the parties, i.e., industrial standards demand to be formulated and straightly followed.

One time you decided on the data representation scheme, certain constraints, in particular, the precision and range will be imposed. Hence, information technology is of import to understand data representation to write correct and high-performance programs.

Rosette Rock and the Decipherment of Egyptian Hieroglyphs

RosettaStone hieroglyphs

Egyptian hieroglyphs (next-to-left) were used by the ancient Egyptians since 4000BC. Unfortunately, since 500AD, no one could longer read the ancient Egyptian hieroglyphs, until the re-discovery of the Rosette Stone in 1799 by Napoleon's troop (during Napoleon's Egyptian invasion) near the boondocks of Rashid (Rosetta) in the Nile Delta.

The Rosetta Rock (left) is inscribed with a decree in 196BC on behalf of King Ptolemy 5. The prescript appears in 3 scripts: the upper text is Ancient Egyptian hieroglyphs, the middle portion Demotic script, and the lowest Ancient Greek. Considering it presents essentially the same text in all three scripts, and Ancient Greek could still be understood, it provided the primal to the decipherment of the Egyptian hieroglyphs.

The moral of the story is unless yous know the encoding scheme, in that location is no manner that you can decode the information.

Reference and images: Wikipedia.

Integer Representation

Integers are whole numbers or fixed-betoken numbers with the radix point fixed after the to the lowest degree-significant fleck. They are dissimilarity to real numbers or floating-betoken numbers, where the position of the radix betoken varies. It is of import to have note that integers and floating-indicate numbers are treated differently in computers. They have different representation and are processed differently (e.m., floating-point numbers are processed in a so-called floating-point processor). Floating-point numbers will be discussed later.

Computers use a fixed number of bits to represent an integer. The ordinarily-used scrap-lengths for integers are eight-bit, 16-bit, 32-bit or 64-fleck. Besides bit-lengths, there are ii representation schemes for integers:

Unsigned Integers: can represent zip and positive integers.
Signed Integers: can represent zero, positive and negative integers. Three representation schemes had been proposed for signed integers:
1. Sign-Magnitude representation
2. one's Complement representation
3. 2's Complement representation

You lot, equally the programmer, demand to make up one's mind on the bit-length and representation scheme for your integers, depending on your application's requirements. Suppose that you need a counter for counting a small-scale quantity from 0 up to 200, you might choose the 8-scrap unsigned integer scheme every bit there is no negative numbers involved.

n-bit Unsigned Integers

Unsigned integers can represent zero and positive integers, but not negative integers. The value of an unsigned integer is interpreted as "the magnitude of its underlying binary pattern".

Example 1: Suppose that n=8 and the binary pattern is 0100 0001B, the value of this unsigned integer is one×two^0 + 1×two^six = 65D.

Instance 2: Suppose that due north=16 and the binary pattern is 0001 0000 0000 1000B, the value of this unsigned integer is ane×2^three + ane×2^12 = 4104D.

Case iii: Suppose that north=16 and the binary blueprint is 0000 0000 0000 0000B, the value of this unsigned integer is 0.

An north-bit pattern can represent 2^northward distinct integers. An north-bit unsigned integer can represent integers from 0 to (ii^n)-1, equally tabulated beneath:

n	Minimum	Maximum
8	0	(2^eight)-1 (=255)
16	0	(2^16)-1 (=65,535)
32	0	(two^32)-1 (=iv,294,967,295) (nine+ digits)
64	0	(2^64)-1 (=18,446,744,073,709,551,615) (19+ digits)

Signed Integers

Signed integers can represent cipher, positive integers, as well as negative integers. Three representation schemes are bachelor for signed integers:

Sign-Magnitude representation
1's Complement representation
2'southward Complement representation

In all the in a higher place three schemes, the well-nigh-pregnant chip (msb) is chosen the sign scrap. The sign bit is used to correspond the sign of the integer - with 0 for positive integers and 1 for negative integers. The magnitude of the integer, nevertheless, is interpreted differently in unlike schemes.

due north-bit Sign Integers in Sign-Magnitude Representation

In sign-magnitude representation:

The most-meaning scrap (msb) is the sign fleck, with value of 0 representing positive integer and ane representing negative integer.
The remaining n-1 bits represents the magnitude (absolute value) of the integer. The accented value of the integer is interpreted as "the magnitude of the (n-1)-bit binary blueprint".

Example 1: Suppose that due north=eight and the binary representation is 0 100 0001B.
Sign scrap is 0 ⇒ positive
Absolute value is 100 0001B = 65D
Hence, the integer is +65D

Case 2: Suppose that north=8 and the binary representation is ane 000 0001B.
Sign bit is ane ⇒ negative
Accented value is 000 0001B = 1D
Hence, the integer is -1D

Instance iii: Suppose that north=viii and the binary representation is 0 000 0000B.
Sign bit is 0 ⇒ positive
Absolute value is 000 0000B = 0D
Hence, the integer is +0D

Example 4: Suppose that n=8 and the binary representation is 1 000 0000B.
Sign chip is one ⇒ negative
Absolute value is 000 0000B = 0D
Hence, the integer is -0D

sign-magnitude representation

The drawbacks of sign-magnitude representation are:

There are two representations (0000 0000B and grand 0000B) for the number zero, which could pb to inefficiency and confusion.
Positive and negative integers need to be processed separately.

n-bit Sign Integers in 1's Complement Representation

In 1'due south complement representation:

Again, the most significant bit (msb) is the sign chip, with value of 0 representing positive integers and 1 representing negative integers.
The remaining due north-1 $.25 represents the magnitude of the integer, as follows:
- for positive integers, the absolute value of the integer is equal to "the magnitude of the (due north-1)-chip binary pattern".
- for negative integers, the accented value of the integer is equal to "the magnitude of the complement (inverse) of the (due north-1)-fleck binary pattern" (hence chosen ane'southward complement).

Example 1: Suppose that n=8 and the binary representation 0 100 0001B.
Sign bit is 0 ⇒ positive
Absolute value is 100 0001B = 65D
Hence, the integer is +65D

Example 2: Suppose that n=8 and the binary representation 1 000 0001B.
Sign bit is 1 ⇒ negative
Absolute value is the complement of 000 0001B, i.e., 111 1110B = 126D
Hence, the integer is -126D

Example 3: Suppose that northward=8 and the binary representation 0 000 0000B.
Sign fleck is 0 ⇒ positive
Accented value is 000 0000B = 0D
Hence, the integer is +0D

Example four: Suppose that n=8 and the binary representation 1 111 1111B.
Sign bit is 1 ⇒ negative
Absolute value is the complement of 111 1111B, i.e., 000 0000B = 0D
Hence, the integer is -0D

1's complement

Again, the drawbacks are:

There are two representations (0000 0000B and 1111 1111B) for zero.
The positive integers and negative integers need to be candy separately.

n-bit Sign Integers in ii's Complement Representation

In 2's complement representation:

Again, the most significant bit (msb) is the sign fleck, with value of 0 representing positive integers and 1 representing negative integers.
The remaining northward-1 $.25 represents the magnitude of the integer, as follows:
- for positive integers, the accented value of the integer is equal to "the magnitude of the (due north-1)-bit binary pattern".
- for negative integers, the absolute value of the integer is equal to "the magnitude of the complement of the (n-1)-bit binary pattern plus one" (hence called 2'south complement).

Instance 1: Suppose that due north=8 and the binary representation 0 100 0001B.
Sign bit is 0 ⇒ positive
Absolute value is 100 0001B = 65D
Hence, the integer is +65D

Example 2: Suppose that n=8 and the binary representation 1 000 0001B.
Sign bit is 1 ⇒ negative
Absolute value is the complement of 000 0001B plus ane, i.e., 111 1110B + 1B = 127D
Hence, the integer is -127D

Example iii: Suppose that n=8 and the binary representation 0 000 0000B.
Sign flake is 0 ⇒ positive
Absolute value is 000 0000B = 0D
Hence, the integer is +0D

Instance four: Suppose that n=8 and the binary representation i 111 1111B.
Sign bit is i ⇒ negative
Accented value is the complement of 111 1111B plus i, i.e., 000 0000B + 1B = 1D
Hence, the integer is -1D

2's complement

Computers utilise ii's Complement Representation for Signed Integers

We have discussed iii representations for signed integers: signed-magnitude, 1's complement and 2's complement. Computers use ii'south complement in representing signed integers. This is because:

At that place is just one representation for the number zero in ii's complement, instead of 2 representations in sign-magnitude and 1'southward complement.
Positive and negative integers tin exist treated together in add-on and subtraction. Subtraction can exist carried out using the "addition logic".

Example 1: Addition of 2 Positive Integers: Suppose that n=8, 65D + 5D = 70D

65D →    0100 0001B  5D →    0000 0101B(+           0100 0110B    → 70D (OK)

Instance 2: Subtraction is treated as Addition of a Positive and a Negative Integers: Suppose that north=8, 5D - 5D = 65D + (-5D) = 60D

65D →    0100 0001B -5D →    1111 1011B(+           0011 1100B    → 60D (discard conduct - OK)

Example 3: Addition of Two Negative Integers: Suppose that n=8, -65D - 5D = (-65D) + (-5D) = -70D

-65D →    1011 1111B  -5D →    1111 1011B(+            1011 1010B    → -70D (discard carry - OK)

Because of the fixed precision (i.eastward., fixed number of bits), an n-flake ii's complement signed integer has a sure range. For example, for n=viii, the range of 2's complement signed integers is -128 to +127. During addition (and subtraction), it is important to check whether the result exceeds this range, in other words, whether overflow or underflow has occurred.

Example 4: Overflow: Suppose that north=8, 127D + 2d = 129D (overflow - beyond the range)

127D →    0111 1111B   2D →    0000 0010B(+            m 0001B    → -127D (wrong)

Example 5: Underflow: Suppose that n=viii, -125D - 5D = -130D (underflow - beneath the range)

-125D →    thou 0011B   -5D →    1111 1011B(+             0111 1110B    → +126D (incorrect)

The following diagram explains how the 2's complement works. By re-arranging the number line, values from -128 to +127 are represented contiguously past ignoring the comport bit.

signed integer

Range of n-fleck 2'south Complement Signed Integers

An northward-flake two'southward complement signed integer can represent integers from -two^(n-1) to +two^(due north-1)-1, equally tabulated. Accept note that the scheme can correspond all the integers within the range, without any gap. In other words, at that place is no missing integers within the supported range.

north	minimum	maximum
8	-(2^seven) (=-128)	+(2^7)-one (=+127)
16	-(2^15) (=-32,768)	+(2^15)-1 (=+32,767)
32	-(2^31) (=-2,147,483,648)	+(2^31)-1 (=+2,147,483,647)(nine+ digits)
64	-(2^63) (=-9,223,372,036,854,775,808)	+(2^63)-i (=+9,223,372,036,854,775,807)(18+ digits)

Decoding 2's Complement Numbers

Cheque the sign flake (denoted as Due south).
If S=0, the number is positive and its absolute value is the binary value of the remaining due north-1 $.25.
If South=1, the number is negative. you could "capsize the due north-1 bits and plus i" to get the accented value of negative number.
Alternatively, y'all could scan the remaining north-ane bits from the correct (least-meaning chip). Look for the starting time occurrence of 1. Flip all the bits to the left of that kickoff occurrence of 1. The flipped blueprint gives the absolute value. For case,
```
n = 8, bit pattern = 1 100 0100B S = 1 → negative Scanning from the correct and flip all the bits to the left of the first occurrence of 1 ⇒              011 1100B = 60D Hence, the value is -60D
```

Large Endian vs. Little Endian

Modernistic computers store one byte of data in each memory accost or location, i.eastward., byte addressable memory. An 32-scrap integer is, therefore, stored in 4 memory addresses.

The term"Endian" refers to the order of storing bytes in computer retentiveness. In "Big Endian" scheme, the virtually meaning byte is stored first in the everyman retention address (or big in first), while "Trivial Endian" stores the least significant bytes in the lowest memory address.

For example, the 32-flake integer 12345678H (305419896_x) is stored as 12H 34H 56H 78H in big endian; and 78H 56H 34H 12H in piddling endian. An xvi-fleck integer 00H 01H is interpreted as 0001H in large endian, and 0100H as little endian.

Do (Integer Representation)

What are the ranges of 8-flake, 16-bit, 32-bit and 64-flake integer, in "unsigned" and "signed" representation?
Requite the value of 88, 0, one, 127, and 255 in 8-bit unsigned representation.
Give the value of +88, -88 , -i, 0, +1, -128, and +127 in 8-bit two'southward complement signed representation.
Give the value of +88, -88 , -1, 0, +ane, -127, and +127 in 8-bit sign-magnitude representation.
Requite the value of +88, -88 , -1, 0, +i, -127 and +127 in eight-bit 1's complement representation.
[TODO] more.

Answers

The range of unsigned n-bit integers is [0, 2^n - ane]. The range of n-bit 2'south complement signed integer is [-2^(n-i), +2^(n-1)-i];
88 (0101 1000), 0 (0000 0000), one (0000 0001), 127 (0111 1111), 255 (1111 1111).
+88 (0101 1000), -88 (1010 1000), -one (1111 1111), 0 (0000 0000), +1 (0000 0001), -128 (m 0000), +127 (0111 1111).
+88 (0101 1000), -88 (1101 1000), -1 (1000 0001), 0 (0000 0000 or yard 0000), +i (0000 0001), -127 (1111 1111), +127 (0111 1111).
+88 (0101 1000), -88 (1010 0111), -1 (1111 1110), 0 (0000 0000 or 1111 1111), +1 (0000 0001), -127 (grand 0000), +127 (0111 1111).

Floating-Point Number Representation

A floating-betoken number (or real number) tin can represent a very large (1.23×10^88) or a very modest (one.23×10^-88) value. It could also stand for very large negative number (-1.23×10^88) and very small negative number (-1.23×10^88), as well as cypher, as illustrated:

Representation_FloatingPointNumbers

A floating-indicate number is typically expressed in the scientific note, with a fraction (F), and an exponent (E) of a certain radix (r), in the form of F×r^E. Decimal numbers use radix of x (F×ten^East); while binary numbers utilise radix of 2 (F×2^Due east).

Representation of floating point number is non unique. For example, the number 55.66 tin be represented as 5.566×10^1, 0.5566×10^2, 0.05566×10^3, and so on. The fractional part can be normalized. In the normalized form, there is merely a unmarried not-zero digit before the radix point. For example, decimal number 123.4567 can be normalized every bit 1.234567×10^ii; binary number 1010.1011B can be normalized as 1.0101011B×2^3.

It is important to note that floating-point numbers suffer from loss of precision when represented with a fixed number of bits (e.m., 32-flake or 64-chip). This is because there are infinite number of real numbers (even within a pocket-sized range of says 0.0 to 0.i). On the other hand, a n-fleck binary blueprint can stand for a finite two^north distinct numbers. Hence, not all the real numbers can exist represented. The nearest approximation volition be used instead, resulted in loss of accurateness.

It is also important to note that floating number arithmetic is very much less efficient than integer arithmetic. It could exist speed up with a so-chosen defended floating-point co-processor. Hence, apply integers if your application does non require floating-indicate numbers.

In computers, floating-point numbers are represented in scientific notation of fraction (F) and exponent (E) with a radix of 2, in the form of F×ii^E. Both E and F tin can be positive also equally negative. Modernistic computers adopt IEEE 754 standard for representing floating-betoken numbers. There are two representation schemes: 32-bit unmarried-precision and 64-bit double-precision.

IEEE-754 32-bit Single-Precision Floating-Point Numbers

In 32-bit unmarried-precision floating-point representation:

The most significant bit is the sign flake (S), with 0 for positive numbers and i for negative numbers.
The following viii bits represent exponent (Due east).
The remaining 23 bits represents fraction (F).

float

Normalized Class

Let'southward illustrate with an example, suppose that the 32-bit design is 1 thousand 0001 011 0000 0000 0000 0000 0000 , with:

S = ane
E = thousand 0001
F = 011 0000 0000 0000 0000 0000

In the normalized form, the actual fraction is normalized with an implicit leading 1 in the class of 1.F. In this example, the actual fraction is 1.011 0000 0000 0000 0000 0000 = 1 + 1×2^-ii + 1×2^-3 = 1.375D.

The sign fleck represents the sign of the number, with S=0 for positive and Southward=1 for negative number. In this example with Southward=ane, this is a negative number, i.eastward., -1.375D.

In normalized form, the actual exponent is E-127 (so-called excess-127 or bias-127). This is because we need to represent both positive and negative exponent. With an 8-flake E, ranging from 0 to 255, the excess-127 scheme could provide bodily exponent of -127 to 128. In this example, E-127=129-127=second.

Hence, the number represented is -1.375×2^2=-5.5D.

De-Normalized Class

Normalized class has a serious trouble, with an implicit leading 1 for the fraction, it cannot stand for the number zero! Convince yourself on this!

De-normalized form was devised to represent nix and other numbers.

For Eastward=0, the numbers are in the de-normalized class. An implicit leading 0 (instead of 1) is used for the fraction; and the actual exponent is e'er -126. Hence, the number zero tin can exist represented with E=0 and F=0 (because 0.0×2^-126=0).

We can likewise correspond very small positive and negative numbers in de-normalized form with Eastward=0. For example, if South=i, E=0, and F=011 0000 0000 0000 0000 0000. The bodily fraction is 0.011=1×2^-2+i×two^-3=0.375D. Since S=ane, information technology is a negative number. With Due east=0, the actual exponent is -126. Hence the number is -0.375×2^-126 = -4.4×10^-39, which is an extremely modest negative number (close to zero).

Summary

In summary, the value (N) is calculated as follows:

For 1 ≤ East ≤ 254, North = (-1)^S × ane.F × 2^(East-127). These numbers are in the so-called normalized form. The sign-bit represents the sign of the number. Fractional part (1.F) are normalized with an implicit leading one. The exponent is bias (or in excess) of 127, and so as to represent both positive and negative exponent. The range of exponent is -126 to +127.
For E = 0, Northward = (-1)^South × 0.F × 2^(-126). These numbers are in the and so-called denormalized course. The exponent of 2^-126 evaluates to a very pocket-sized number. Denormalized form is needed to stand for zero (with F=0 and E=0). Information technology tin also represents very minor positive and negative number close to nix.
For East = 255, it represents special values, such every bit ±INF (positive and negative infinity) and NaN (non a number). This is beyond the scope of this commodity.

Example 1: Suppose that IEEE-754 32-bit floating-signal representation pattern is 0 10000000 110 0000 0000 0000 0000 0000 .

Sign bit South = 0 ⇒ positive number East = 1000 0000B = 128D (in normalized form) Fraction is ane.11B (with an implicit leading 1) = 1 + 1×2^-1 + ane×2^-2 = one.75D The number is +one.75 × two^(128-127) = +3.5D

Example 2: Suppose that IEEE-754 32-bit floating-point representation pattern is 1 01111110 100 0000 0000 0000 0000 0000 .

Sign fleck S = i ⇒ negative number E = 0111 1110B = 126D (in normalized form) Fraction is 1.1B  (with an implicit leading 1) = ane + ii^-ane = i.5D The number is -1.v × 2^(126-127) = -0.75D

Example three: Suppose that IEEE-754 32-bit floating-bespeak representation pattern is one 01111110 000 0000 0000 0000 0000 0001 .

Sign flake S = 1 ⇒ negative number E = 0111 1110B = 126D (in normalized class) Fraction is ane.000 0000 0000 0000 0000 0001B  (with an implicit leading 1) = ane + ii^-23 The number is -(one + 2^-23) × 2^(126-127) = -0.500000059604644775390625 (may not be verbal in decimal!)

Example 4 (De-Normalized Grade): Suppose that IEEE-754 32-fleck floating-point representation pattern is ane 00000000 000 0000 0000 0000 0000 0001 .

Sign chip S = 1 ⇒ negative number E = 0 (in de-normalized grade) Fraction is 0.000 0000 0000 0000 0000 0001B  (with an implicit leading 0) = 1×ii^-23 The number is -2^-23 × 2^(-126) = -2×(-149) ≈ -ane.4×10^-45

Exercises (Floating-point Numbers)

Compute the largest and smallest positive numbers that tin be represented in the 32-flake normalized form.
Compute the largest and smallest negative numbers can be represented in the 32-bit normalized form.
Repeat (1) for the 32-bit denormalized form.
Repeat (two) for the 32-scrap denormalized class.

Hints:

Largest positive number: S=0, Eastward=1111 1110 (254), F=111 1111 1111 1111 1111 1111.
Smallest positive number: Due south=0, East=0000 00001 (1), F=000 0000 0000 0000 0000 0000.
Same as above, but Due south=i.
Largest positive number: S=0, E=0, F=111 1111 1111 1111 1111 1111.
Smallest positive number: Due south=0, E=0, F=000 0000 0000 0000 0000 0001.
Same as in a higher place, but S=i.

Notes For Java Users

You can use JDK methods Float.intBitsToFloat(int bits) or Double.longBitsToDouble(long bits) to create a single-precision 32-bit float or double-precision 64-chip double with the specific bit patterns, and print their values. For examples,

System.out.println(Float.intBitsToFloat(0x7fffff)); Organisation.out.println(Double.longBitsToDouble(0x1fffffffffffffL));

IEEE-754 64-bit Double-Precision Floating-Point Numbers

The representation scheme for 64-bit double-precision is similar to the 32-scrap single-precision:

The virtually significant flake is the sign chip (Due south), with 0 for positive numbers and 1 for negative numbers.
The following 11 bits correspond exponent (E).
The remaining 52 $.25 represents fraction (F).

double

The value (N) is calculated as follows:

Normalized form: For 1 ≤ East ≤ 2046, N = (-1)^Southward × ane.F × two^(Eastward-1023).
Denormalized form: For E = 0, N = (-1)^Due south × 0.F × 2^(-1022). These are in the denormalized grade.
For E = 2047, Due north represents special values, such every bit ±INF (infinity), NaN (non a number).

More on Floating-Point Representation

At that place are three parts in the floating-point representation:

The sign bit (S) is cocky-explanatory (0 for positive numbers and one for negative numbers).
For the exponent (Eastward), a so-called bias (or backlog) is applied so as to represent both positive and negative exponent. The bias is fix at one-half of the range. For single precision with an 8-bit exponent, the bias is 127 (or excess-127). For double precision with a 11-scrap exponent, the bias is 1023 (or excess-1023).
The fraction (F) (also chosen the mantissa or significand) is equanimous of an implicit leading flake (before the radix point) and the fractional bits (after the radix point). The leading bit for normalized numbers is i; while the leading bit for denormalized numbers is 0.

Normalized Floating-Signal Numbers

In normalized form, the radix bespeak is placed after the first non-zero digit, e,g., 9.8765D×10^-23D, 1.001011B×2^11B. For binary number, the leading flake is always 1, and need not be represented explicitly - this saves 1 fleck of storage.

In IEEE 754'southward normalized form:

For single-precision, 1 ≤ East ≤ 254 with excess of 127. Hence, the actual exponent is from -126 to +127. Negative exponents are used to represent small numbers (< 1.0); while positive exponents are used to represent large numbers (> 1.0).
N = (-1)^S × one.F × 2^(E-127)
For double-precision, one ≤ E ≤ 2046 with backlog of 1023. The bodily exponent is from -1022 to +1023, and
N = (-ane)^Southward × ane.F × 2^(E-1023)

Take note that n-scrap pattern has a finite number of combinations (=ii^due north), which could represent finite distinct numbers. It is not possible to stand for the infinite numbers in the real axis (fifty-fifty a pocket-size range says 0.0 to one.0 has space numbers). That is, not all floating-point numbers tin can be accurately represented. Instead, the closest approximation is used, which leads to loss of accurateness.

The minimum and maximum normalized floating-point numbers are:

Precision	Normalized N(min)	Normalized N(max)
Single	0080 0000H 0 00000001 00000000000000000000000B E = i, F = 0 North(min) = 1.0B × 2^-126 (≈1.17549435 × 10^-38)	7F7F FFFFH 0 11111110 00000000000000000000000B E = 254, F = 0 Northward(max) = 1.one...1B × 2^127 = (2 - 2^-23) × 2^127 (≈3.4028235 × 10^38)
Double	0010 0000 0000 0000H Due north(min) = 1.0B × 2^-1022 (≈2.2250738585072014 × 10^-308)	7FEF FFFF FFFF FFFFH Due north(max) = i.1...1B × 2^1023 = (2 - two^-52) × ii^1023 (≈1.7976931348623157 × 10^308)

real numbers

Denormalized Floating-Bespeak Numbers

If E = 0, but the fraction is non-naught, then the value is in denormalized class, and a leading bit of 0 is assumed, as follows:

For single-precision, E = 0,
North = (-one)^S × 0.F × 2^(-126)
For double-precision, E = 0,
N = (-ane)^S × 0.F × 2^(-1022)

Denormalized form can represent very pocket-sized numbers airtight to zero, and zero, which cannot be represented in normalized course, as shown in the above figure.

The minimum and maximum of denormalized floating-indicate numbers are:

Precision	Denormalized D(min)	Denormalized D(max)
Unmarried	0000 0001H 0 00000000 00000000000000000000001B East = 0, F = 00000000000000000000001B D(min) = 0.0...1 × 2^-126 = 1 × ii^-23 × 2^-126 = ii^-149 (≈i.4 × 10^-45)	007F FFFFH 0 00000000 11111111111111111111111B East = 0, F = 11111111111111111111111B D(max) = 0.i...one × ii^-126 = (one-2^-23)×2^-126 (≈one.1754942 × 10^-38)
Double	0000 0000 0000 0001H D(min) = 0.0...1 × 2^-1022 = 1 × two^-52 × 2^-1022 = 2^-1074 (≈4.nine × 10^-324)	001F FFFF FFFF FFFFH D(max) = 0.1...ane × ii^-1022 = (one-2^-52)×2^-1022 (≈4.4501477170144023 × 10^-308)

Special Values

Naught: Aught cannot be represented in the normalized class, and must be represented in denormalized class with Due east=0 and F=0. In that location are ii representations for zero: +0 with South=0 and -0 with S=one.

Infinity: The value of +infinity (e.g., ane/0) and -infinity (east.yard., -1/0) are represented with an exponent of all 1's (E = 255 for single-precision and E = 2047 for double-precision), F=0, and S=0 (for +INF) and Southward=1 (for -INF).

Not a Number (NaN): NaN denotes a value that cannot be represented equally real number (e.g. 0/0). NaN is represented with Exponent of all 1's (Due east = 255 for single-precision and East = 2047 for double-precision) and any non-aught fraction.

Character Encoding

In computer memory, character are "encoded" (or "represented") using a chosen "character encoding schemes" (aka "character set up", "charset", "character map", or "code folio").

For example, in ASCII (as well as Latin1, Unicode, and many other grapheme sets):

lawmaking numbers 65D (41H) to 90D (5AH) represents 'A' to 'Z', respectively.
code numbers 97D (61H) to 122D (7AH) represents 'a' to 'z', respectively.
code numbers 48D (30H) to 57D (39H) represents '0' to '9', respectively.

It is of import to annotation that the representation scheme must exist known before a binary pattern tin be interpreted. East.chiliad., the viii-fleck pattern "0100 0010B" could correspond annihilation under the sun known simply to the person encoded it.

The most unremarkably-used character encoding schemes are: seven-scrap ASCII (ISO/IEC 646) and 8-fleck Latin-x (ISO/IEC 8859-x) for western european characters, and Unicode (ISO/IEC 10646) for internationalization (i18n).

A 7-bit encoding scheme (such as ASCII) can represent 128 characters and symbols. An eight-bit character encoding scheme (such every bit Latin-x) can represent 256 characters and symbols; whereas a 16-fleck encoding scheme (such as Unicode UCS-2) can represents 65,536 characters and symbols.

7-bit ASCII Code (aka United states of america-ASCII, ISO/IEC 646, ITU-T T.fifty)

ASCII (American Standard Code for Information Interchange) is 1 of the earlier character coding schemes.
ASCII is originally a vii-fleck code. It has been extended to viii-flake to better employ the 8-fleck computer memory organization. (The 8th-bit was originally used for parity bank check in the early computers.)

Lawmaking numbers 32D (20H) to 126D (7EH) are printable (displayable) characters equally tabulated (arranged in hexadecimal and decimal) every bit follows:

Hex	0	1	2	3	4	5	6	7	eight	9	A	B	C	D	E	F
2	SP	!	"	#	$	%	&	'	(	)	*	+	,	-	.	/
3	0	ane	2	3	4	5	6	seven	eight	9	:	;	<	=	>	?
4	@	A	B	C	D	E	F	G	H	I	J	Thousand	50	M	N	O
v	P	Q	R	S	T	U	Five	W	X	Y	Z	[	\	]	^	_
6	`	a	b	c	d	e	f	g	h	i	j	m	50	yard	n	o
7	p	q	r	south	t	u	v	w	10	y	z	{	\|	}	~

Dec	0	1	two	iii	4	v	6	vii	viii	9
3			SP	!	"	#	$	%	&	'
4	(	)	*	+	,	-	.	/	0	1
5	two	3	four	5	half dozen	7	8	9	:	;
6	<	=	>	?	@	A	B	C	D	E
vii	F	G	H	I	J	K	L	M	N	O
8	P	Q	R	S	T	U	V	W	X	Y
ix	Z	[	\	]	^	_	`	a	b	c
x	d	e	f	g	h	i	j	m	l	m
11	due north	o	p	q	r	southward	t	u	5	westward
12	10	y	z	{	\|	}	~

Code number 32D (20H) is the bare or infinite character.
'0' to 'ix': 30H-39H (0011 0001B to 0011 1001B) or (0011 xxxxB where xxxx is the equivalent integer value)
'A' to 'Z': 41H-5AH (0101 0001B to 0101 1010B) or (010x xxxxB). 'A' to 'Z' are continuous without gap.
'a' to 'z': 61H-7AH (0110 0001B to 0111 1010B) or (011x xxxxB). 'A' to 'Z' are also continuous without gap. However, there is a gap between uppercase and lowercase messages. To convert between upper and lowercase, flip the value of chip-v.

Code numbers 0D (00H) to 31D (1FH), and 127D (7FH) are special control characters, which are non-printable (not-displayable), as tabulated below. Many of these characters were used in the early days for transmission command (e.1000., STX, ETX) and printer command (due east.yard., Form-Feed), which are now obsolete. The remaining meaningful codes today are:
- 09H for Tab ('\t').
- 0AH for Line-Feed or newline (LF or '\northward') and 0DH for Carriage-Render (CR or 'r'), which are used as line delimiter (aka line separator, end-of-line) for text files. At that place is unfortunately no standard for line delimiter: Unixes and Mac use 0AH (LF or "\north"), Windows utilise 0D0AH (CR+LF or "\r\north"). Programming languages such as C/C++/Java (which was created on Unix) use 0AH (LF or "\northward").
- In programming languages such as C/C++/Java, line-feed (0AH) is denoted as '\due north', wagon-render (0DH) as '\r', tab (09H) as '\t'.

Dec	HEX	Meaning		DEC	HEX	Pregnant
0	00	NUL	Zippo	17	11	DC1	Device Command i
1	01	SOH	Start of Heading	18	12	DC2	Device Control 2
2	02	STX	Outset of Text	19	xiii	DC3	Device Control 3
3	03	ETX	End of Text	20	xiv	DC4	Device Control four
iv	04	EOT	End of Transmission	21	fifteen	NAK	Negative Ack.
five	05	ENQ	Enquiry	22	sixteen	SYN	Sync. Idle
half-dozen	06	ACK	Acknowledgment	23	17	ETB	Cease of Transmission
7	07	BEL	Bell	24	eighteen	CAN	Cancel
viii	08	BS	Back Space `'\b'`	25	19	EM	Stop of Medium
ix	09	HT	Horizontal Tab `'\t'`	26	1A	SUB	Substitute
10	0A	LF	Line Feed `'\n'`	27	1B	ESC	Escape
11	0B	VT	Vertical Feed	28	1C	IS4	File Separator
12	0C	FF	Form Feed `'f'`	29	1D	IS3	Grouping Separator
thirteen	0D	CR	Carriage Return `'\r'`	30	1E	IS2	Record Separator
14	0E	So	Shift Out	31	1F	IS1	Unit of measurement Separator
15	0F	SI	Shift In
16	10	DLE	Datalink Escape	127	7F	DEL	Delete

eight-bit Latin-1 (aka ISO/IEC 8859-1)

ISO/IEC-8859 is a collection of eight-bit character encoding standards for the western languages.

ISO/IEC 8859-ane, aka Latin alphabet No. 1, or Latin-one in brusque, is the most ordinarily-used encoding scheme for western european languages. Information technology has 191 printable characters from the latin script, which covers languages like English, German, Italian, Portuguese and Spanish. Latin-i is backward compatible with the vii-bit US-ASCII lawmaking. That is, the first 128 characters in Latin-one (lawmaking numbers 0 to 127 (7FH)), is the same equally US-ASCII. Code numbers 128 (80H) to 159 (9FH) are not assigned. Code numbers 160 (A0H) to 255 (FFH) are given equally follows:

Hex	0	i	2	three	4	5	half dozen	7	8	9	A	B	C	D	E	F
A	NBSP	¡	¢	£	¤	¥	¦	§	¨	©	ª	«	¬	SHY	®	¯
B	°	±	²	³	´	µ	¶	·	¸	¹	º	»	¼	½	¾	¿
C	À	Á	Â	Ã	Ä	Å	Æ	Ç	È	É	Ê	Ë	Ì	Í	Î	Ï
D	Ð	Ñ	Ò	Ó	Ô	Õ	Ö	×	Ø	Ù	Ú	Û	Ü	Ý	Þ	ß
East	à	á	â	ã	ä	å	æ	ç	è	é	ê	ë	ì	í	î	ï
F	ð	ñ	ò	ó	ô	õ	ö	÷	ø	ù	ú	û	ü	ý	þ	ÿ

ISO/IEC-8859 has 16 parts. Likewise the virtually ordinarily-used Part one, Part 2 is meant for Central European (Polish, Czech, Hungarian, etc), Part 3 for South European (Turkish, etc), Part 4 for N European (Estonian, Latvian, etc), Part 5 for Cyrillic, Part 6 for Arabic, Office 7 for Greek, Function 8 for Hebrew, Part nine for Turkish, Function x for Nordic, Function 11 for Thai, Part 12 was abandon, Role xiii for Baltic Rim, Part 14 for Celtic, Part 15 for French, Finnish, etc. Part 16 for South-Eastern European.

Other eight-scrap Extension of US-ASCII (ASCII Extensions)

Beside the standardized ISO-8859-x, there are many eight-bit ASCII extensions, which are not compatible with each others.

ANSI (American National Standards Establish) (aka Windows-1252, or Windows Codepage 1252): for Latin alphabets used in the legacy DOS/Windows systems. Information technology is a superset of ISO-8859-1 with code numbers 128 (80H) to 159 (9FH) assigned to displayable characters, such as "smart" unmarried-quotes and double-quotes. A common problem in web browsers is that all the quotes and apostrophes (produced by "smart quotes" in some Microsoft software) were replaced with question marks or some strange symbols. Information technology it because the document is labeled as ISO-8859-1 (instead of Windows-1252), where these code numbers are undefined. Most modern browsers and east-mail clients treat charset ISO-8859-one every bit Windows-1252 in order to adapt such mis-labeling.

Hex	0	1	2	3	4	5	6	7	8	ix	A	B	C	D	East	F
eight	€		‚	ƒ	„	…	†	‡	ˆ	‰	Š	‹	Œ		Ž
9		'	'	"	"	•	–	—		™	š	›	œ		ž	Ÿ

EBCDIC (Extended Binary Coded Decimal Interchange Code): Used in the early IBM computers.

Unicode (aka ISO/IEC 10646 Universal Character Ready)

Before Unicode, no single character encoding scheme could represent characters in all languages. For example, western european uses several encoding schemes (in the ISO-8859-ten family unit). Fifty-fifty a unmarried language like Chinese has a few encoding schemes (GB2312/GBK, BIG5). Many encoding schemes are in conflict of each other, i.eastward., the same lawmaking number is assigned to unlike characters.

Unicode aims to provide a standard character encoding scheme, which is universal, efficient, uniform and unambiguous. Unicode standard is maintained by a non-turn a profit organization called the Unicode Consortium (@ world wide web.unicode.org). Unicode is an ISO/IEC standard 10646.

Unicode is backward compatible with the 7-bit U.s.-ASCII and 8-bit Latin-ane (ISO-8859-1). That is, the beginning 128 characters are the same as Us-ASCII; and the start 256 characters are the same as Latin-1.

Unicode originally uses 16 bits (called UCS-2 or Unicode Character Fix - 2 byte), which can correspond upwards to 65,536 characters. It has since been expanded to more than 16 $.25, currently stands at 21 $.25. The range of the legal codes in ISO/IEC 10646 is now from U+0000H to U+10FFFFH (21 $.25 or virtually 2 million characters), covering all current and aboriginal historical scripts. The original 16-bit range of U+0000H to U+FFFFH (65536 characters) is known as Basic Multilingual Plane (BMP), roofing all the major languages in use currently. The characters outside BMP are called Supplementary Characters, which are not oft-used.

Unicode has 2 encoding schemes:

UCS-ii (Universal Character Set - 2 Byte): Uses 2 bytes (16 bits), covering 65,536 characters in the BMP. BMP is sufficient for virtually of the applications. UCS-2 is now obsolete.
UCS-4 (Universal Graphic symbol Set - four Byte): Uses iv bytes (32 bits), roofing BMP and the supplementary characters.

UTF-viii (Unicode Transformation Format - 8-fleck)

The 16/32-bit Unicode (UCS-ii/four) is grossly inefficient if the document contains mainly ASCII characters, considering each character occupies 2 bytes of storage. Variable-length encoding schemes, such every bit UTF-8, which uses 1-four bytes to stand for a graphic symbol, was devised to meliorate the efficiency. In UTF-8, the 128 commonly-used US-ASCII characters apply only one byte, but some less-usually characters may require up to four bytes. Overall, the efficiency improved for document containing mainly United states of america-ASCII texts.

The transformation between Unicode and UTF-8 is as follows:

Bits	Unicode	UTF-viii Code	Bytes
seven	00000000 0xxxxxxx	0xxxxxxx	1 (ASCII)
11	00000yyy yyxxxxxx	110yyyyy 10xxxxxx	ii
16	zzzzyyyy yyxxxxxx	1110zzzz 10yyyyyy 10xxxxxx	three
21	000uuuuu zzzzyyyy yyxxxxxx	11110uuu 10uuzzzz 10yyyyyy 10xxxxxx	four

In UTF-eight, Unicode numbers corresponding to the vii-bit ASCII characters are padded with a leading zero; thus has the same value equally ASCII. Hence, UTF-8 can exist used with all software using ASCII. Unicode numbers of 128 and higher up, which are less frequently used, are encoded using more than bytes (2-4 bytes). UTF-8 generally requires less storage and is compatible with ASCII. The drawback of UTF-8 is more processing ability needed to unpack the code due to its variable length. UTF-8 is the about pop format for Unicode.

Notes:

UTF-8 uses 1-3 bytes for the characters in BMP (16-bit), and 4 bytes for supplementary characters exterior BMP (21-bit).
The 128 ASCII characters (basic Latin letters, digits, and punctuation signs) utilise i byte. Almost European and Middle Eastward characters use a 2-byte sequence, which includes extended Latin letters (with tilde, macron, astute, grave and other accents), Greek, Armenian, Hebrew, Arabic, and others. Chinese, Japanese and Korean (CJK) utilize three-byte sequences.
All the bytes, except the 128 ASCII characters, have a leading '1' bit. In other words, the ASCII bytes, with a leading '0' chip, can be identified and decoded hands.

Example: 您好 (Unicode: 60A8H 597DH)

Unicode (UCS-2) is 60A8H = 0110 0000 10 101000B ⇒ UTF-8 is 11100110 10000010 10101000B = E6 82 A8H Unicode (UCS-2) is 597DH = 0101 1001 01 111101B ⇒ UTF-8 is 11100101 10100101 10111101B = E5 A5 BDH

UTF-16 (Unicode Transformation Format - xvi-chip)

UTF-16 is a variable-length Unicode graphic symbol encoding scheme, which uses ii to 4 bytes. UTF-16 is not commonly used. The transformation table is equally follows:

Unicode	UTF-16 Lawmaking	Bytes
xxxxxxxx xxxxxxxx	Same as UCS-two - no encoding	2
000uuuuu zzzzyyyy yyxxxxxx (uuuuu≠0)	110110ww wwzzzzyy 110111yy yyxxxxxx (wwww = uuuuu - 1)	4

Take note that for the 65536 characters in BMP, the UTF-16 is the same as UCS-2 (2 bytes). However, four bytes are used for the supplementary characters outside the BMP.

For BMP characters, UTF-16 is the aforementioned as UCS-2. For supplementary characters, each character requires a pair 16-bit values, the first from the loftier-surrogates range, (\uD800-\uDBFF), the 2nd from the depression-surrogates range (\uDC00-\uDFFF).

UTF-32 (Unicode Transformation Format - 32-fleck)

Same as UCS-4, which uses 4 bytes for each character - unencoded.

Formats of Multi-Byte (e.thou., Unicode) Text Files

Endianess (or byte-order): For a multi-byte character, you need to accept care of the order of the bytes in storage. In big endian, the near significant byte is stored at the memory location with the everyman address (big byte offset). In little endian, the well-nigh significant byte is stored at the retentivity location with the highest address (little byte first). For example, 您 (with Unicode number of 60A8H) is stored as 60 A8 in big endian; and stored as A8 60 in footling endian. Big endian, which produces a more than readable hex dump, is more commonly-used, and is often the default.

BOM (Byte Order Mark): BOM is a special Unicode character having code number of FEFFH, which is used to differentiate big-endian and trivial-endian. For large-endian, BOM appears every bit FE FFH in the storage. For little-endian, BOM appears as FF FEH. Unicode reserves these two code numbers to prevent information technology from crashing with another character.

Unicode text files could take on these formats:

Big Endian: UCS-2BE, UTF-16BE, UTF-32BE.
Little Endian: UCS-2LE, UTF-16LE, UTF-32LE.
UTF-sixteen with BOM. The first character of the file is a BOM character, which specifies the endianess. For large-endian, BOM appears as Iron FFH in the storage. For lilliputian-endian, BOM appears as FF FEH.

UTF-8 file is ever stored as big endian. BOM plays no function. Withal, in some systems (in particular Windows), a BOM is added as the commencement graphic symbol in the UTF-8 file as the signature to identity the file as UTF-8 encoded. The BOM grapheme (FEFFH) is encoded in UTF-8 every bit EF BB BF. Adding a BOM as the first character of the file is not recommended, as information technology may be incorrectly interpreted in other system. Yous can accept a UTF-8 file without BOM.

Formats of Text Files

Line Delimiter or End-Of-Line (EOL): Sometimes, when you apply the Windows NotePad to open a text file (created in Unix or Mac), all the lines are joined together. This is because unlike operating platforms use dissimilar character every bit the so-chosen line delimiter (or terminate-of-line or EOL). Ii non-printable command characters are involved: 0AH (Line-Feed or LF) and 0DH (Wagon-Return or CR).

Windows/DOS uses OD0AH (CR+LF or "\r\n") as EOL.
Unix and Mac apply 0AH (LF or "\n") only.

End-of-File (EOF): [TODO]

Windows' CMD Codepage

Character encoding scheme (charset) in Windows is called codepage. In CMD shell, you can issue command "chcp" to display the electric current codepage, or "chcp codepage-number" to modify the codepage.

Take note that:

The default codepage 437 (used in the original DOS) is an viii-bit graphic symbol set up chosen Extended ASCII, which is different from Latin-one for lawmaking numbers to a higher place 127.
Codepage 1252 (Windows-1252), is not exactly the same as Latin-i. Information technology assigns code number 80H to 9FH to letters and punctuation, such as smart single-quotes and double-quotes. A common problem in browser that display quotes and apostrophe in question marks or boxes is because the page is supposed to be Windows-1252, but mislabelled equally ISO-8859-ane.
For internationalization and chinese character set: codepage 65001 for UTF8, codepage 1201 for UCS-2BE, codepage 1200 for UCS-2LE, codepage 936 for chinese characters in GB2312, codepage 950 for chinese characters in Big5.

Chinese Graphic symbol Sets

Unicode supports all languages, including asian languages like Chinese (both simplified and traditional characters), Japanese and Korean (collectively called CJK). In that location are more than than twenty,000 CJK characters in Unicode. Unicode characters are often encoded in the UTF-8 scheme, which unfortunately, requires iii bytes for each CJK character, instead of two bytes in the unencoded UCS-ii (UTF-16).

Worse notwithstanding, there are also diverse chinese graphic symbol sets, which is not compatible with Unicode:

GB2312/GBK: for simplified chinese characters. GB2312 uses two bytes for each chinese graphic symbol. The most significant bit (MSB) of both bytes are set up to 1 to co-exist with 7-bit ASCII with the MSB of 0. In that location are about 6700 characters. GBK is an extension of GB2312, which include more characters as well equally traditional chinese characters.
BIG5: for traditional chinese characters BIG5 also uses two bytes for each chinese character. The most significant bit of both bytes are too gear up to i. BIG5 is not compatible with GBK, i.due east., the same code number is assigned to different character.

For example, the earth is made more than interesting with these many standards:

	Standard	Characters	Codes
Simplified	GB2312	和谐	BACD D0B3
	UCS-2	和谐	548C 8C10
	UTF-eight	和谐	E5928C E8B090
Traditional	BIG5	和諧	A94D BFD3
	UCS-2	和諧	548C 8AE7
	UTF-8	和諧	E5928C E8ABA7

Notes for Windows' CMD Users: To display the chinese character correctly in CMD shell, y'all need to choose the correct codepage, eastward.g., 65001 for UTF8, 936 for GB2312/GBK, 950 for Big5, 1201 for UCS-2BE, 1200 for UCS-2LE, 437 for the original DOS. Y'all can utilize command "chcp" to display the current code folio and command "chcp codepage_number " to change the codepage. You lot too have to choose a font that can display the characters (east.g., Courier New, Consolas or Lucida Console, Non Raster font).

Collating Sequences (for Ranking Characters)

A string consists of a sequence of characters in upper or lower cases, e.g., "apple", "BOY", "True cat". In sorting or comparison strings, if we order the characters according to the underlying code numbers (e.g., U.s.-ASCII) graphic symbol-by-character, the social club for the instance would be "Male child", "apple", "True cat" because uppercase letters have a smaller code number than lowercase letters. This does not agree with the so-called dictionary guild, where the same uppercase and lowercase letters accept the same rank. Another common problem in ordering strings is "10" (ten) at times is ordered in front of "1" to "9".

Hence, in sorting or comparison of strings, a and so-called collating sequence (or collation) is often divers, which specifies the ranks for messages (uppercase, lowercase), numbers, and special symbols. There are many collating sequences available. It is entirely up to you lot to cull a collating sequence to meet your awarding's specific requirements. Some case-insensitive dictionary-order collating sequences have the same rank for same capital letter and lowercase letters, i.e., 'A', 'a' ⇒ 'B', 'b' ⇒ ... ⇒ 'Z', 'z'. Some instance-sensitive dictionary-social club collating sequences put the uppercase alphabetic character earlier its lowercase counterpart, i.east., 'A' ⇒'B' ⇒ 'C'... ⇒ 'a' ⇒ 'b' ⇒ 'c'.... Typically, space is ranked before digits '0' to 'nine', followed by the alphabets.

Collating sequence is often linguistic communication dependent, as dissimilar languages use different sets of characters (e.g., á, é, a, α) with their own orders.

For Coffee Programmers - `java.nio.Charset`

JDK 1.4 introduced a new java.nio.charset package to support encoding/decoding of characters from UCS-2 used internally in Coffee program to any supported charset used past external devices.

Case: The following program encodes some Unicode texts in various encoding scheme, and brandish the Hex codes of the encoded byte sequences.

import java.nio.ByteBuffer; import java.nio.CharBuffer; import coffee.nio.charset.Charset;   public class          TestCharsetEncodeDecode          {    public static void main(String[] args) {              String[] charsetNames = {"US-ASCII", "ISO-8859-1", "UTF-8", "UTF-16",                                "UTF-16BE", "UTF-16LE", "GBK", "BIG5"};         Cord bulletin = "Hi,您好!";                System.out.printf("%10s: ", "UCS-2");       for (int i = 0; i < message.length(); i++) {          System.out.printf("%04X ", (int)message.charAt(i));       }       Organisation.out.println();         for (String charsetName: charsetNames) {                    Charset charset = Charset.forName(charsetName);          Organisation.out.printf("%10s: ", charset.name());                      ByteBuffer bb = charset.encode(bulletin);          while (bb.hasRemaining()) {             System.out.printf("%02X ", bb.go());            }          System.out.println();          bb.rewind();       }    } }

          UCS-two: 0048 0069 002C 60A8 597D 0021                 United states-ASCII: 48 69 2C 3F 3F 21               ISO-8859-i: 48 69 2C 3F 3F 21                    UTF-8: 48 69 2C          E6 82 A8          E5 A5 BD          21                   UTF-sixteen:          Atomic number 26 FF          00 48          00 69          00 2C          60 A8          59 7D          00 21                          UTF-16BE:          00 48          00 69          00 2C          60 A8          59 7D          00 21                          UTF-16LE:          48 00          69 00          2C 00          A8 threescore          7D 59          21 00                               GBK: 48 69 2C          C4 FA          BA C3          21                     Big5: 48 69 2C          B1 7A          A6 6E          21

For Java Programmers - `char` and `String`

The char information type are based on the original 16-bit Unicode standard called UCS-two. The Unicode has since evolved to 21 bits, with code range of U+0000 to U+10FFFF. The set up of characters from U+0000 to U+FFFF is known as the Basic Multilingual Plane (BMP). Characters higher up U+FFFF are chosen supplementary characters. A 16-scrap Java char cannot hold a supplementary character.

Call up that in the UTF-16 encoding scheme, a BMP characters uses 2 bytes. It is the same as UCS-2. A supplementary character uses four bytes. and requires a pair of 16-chip values, the start from the loftier-surrogates range, (\uD800-\uDBFF), the second from the low-surrogates range (\uDC00-\uDFFF).

In Coffee, a Cord is a sequences of Unicode characters. Java, in fact, uses UTF-xvi for String and StringBuffer. For BMP characters, they are the same as UCS-2. For supplementary characters, each characters requires a pair of char values.

Java methods that accept a 16-bit char value does not support supplementary characters. Methods that accept a 32-flake int value back up all Unicode characters (in the lower 21 bits), including supplementary characters.

This is meant to be an bookish discussion. I accept all the same to meet the use of supplementary characters!

Displaying Hex Values & Hex Editors

At times, yous may need to display the hex values of a file, especially in dealing with Unicode characters. A Hex Editor is a handy tool that a good programmer should possess in his/her toolbox. There are many freeware/shareware Hex Editor available. Effort google "Hex Editor".

I used the followings:

NotePad++ with Hex Editor Plug-in: Open-source and free. You can toggle between Hex view and Normal view by pushing the "H" push.
PSPad: Freeware. You lot can toggle to Hex view by choosing "View" menu and select "Hex Edit Mode".
TextPad: Shareware without expiration period. To view the Hex value, you need to "open" the file by choosing the file format of "binary" (??).
UltraEdit: Shareware, not gratuitous, 30-twenty-four hour period trial only.

Allow me know if you have a better pick, which is fast to launch, easy to utilise, can toggle between Hex and normal view, costless, ....

The following Coffee program can be used to brandish hex code for Coffee Primitives (integer, grapheme and floating-point):

1 2 three 4 5 half-dozen 7 eight ix ten xi 12 13 14 15 xvi 17 18 19 20 21 22 23 24 25 26 27 28 29 30

public class PrintHexCode {      public static void main(String[] args) {       int i = 12345;       Arrangement.out.println("Decimal is " + i);                               Organisation.out.println("Hex is " + Integer.toHexString(i));              Arrangement.out.println("Binary is " + Integer.toBinaryString(i));        System.out.println("Octal is " + Integer.toOctalString(i));          System.out.printf("Hex is %x\n", i);           System.out.printf("Octal is %o\n", i);           char c = 'a';       Organization.out.println("Graphic symbol is " + c);               System.out.printf("Character is %c\n", c);             System.out.printf("Hex is %10\n", (short)c);            Organisation.out.printf("Decimal is %d\n", (brusk)c);          float f = iii.5f;       Organisation.out.println("Decimal is " + f);           System.out.println(Float.toHexString(f));          f = -0.75f;       Organization.out.println("Decimal is " + f);           System.out.println(Bladder.toHexString(f));          double d = eleven.22;       System.out.println("Decimal is " + d);            Organisation.out.println(Double.toHexString(d));     } }

In Eclipse, y'all can view the hex lawmaking for integer primitive Java variables in debug mode as follows: In debug perspective, "Variable" panel ⇒ Select the "carte du jour" (inverted triangle) ⇒ Java ⇒ Java Preferences... ⇒ Archaic Display Options ⇒ Check "Display hexadecimal values (byte, curt, char, int, long)".

Summary - Why Bother virtually Data Representation?

Integer number ane, floating-bespeak number 1.0 character symbol '1', and string "1" are totally different inside the reckoner memory. You need to know the difference to write good and loftier-performance programs.

In 8-bit signed integer, integer number 1 is represented as 00000001B.
In 8-bit unsigned integer, integer number i is represented as 00000001B.
In 16-scrap signed integer, integer number ane is represented equally 00000000 00000001B.
In 32-flake signed integer, integer number 1 is represented as 00000000 00000000 00000000 00000001B.
In 32-bit floating-point representation, number 1.0 is represented equally 0 01111111 0000000 00000000 00000000B, i.e., Due south=0, East=127, F=0.
In 64-bit floating-signal representation, number 1.0 is represented as 0 01111111111 0000 00000000 00000000 00000000 00000000 00000000 00000000B, i.e., S=0, E=1023, F=0.
In 8-chip Latin-1, the character symbol '1' is represented as 00110001B (or 31H).
In 16-bit UCS-2, the graphic symbol symbol '1' is represented as 00000000 00110001B.
In UTF-8, the character symbol '1' is represented every bit 00110001B.

If you "add" a 16-bit signed integer 1 and Latin-one grapheme 'one' or a string "1", you could become a surprise.

Exercises (Data Representation)

For the following 16-bit codes:

0000 0000 0010 1010; m 0000 0010 1010;

Give their values, if they are representing:

a 16-chip unsigned integer;
a 16-bit signed integer;
two eight-bit unsigned integers;
two viii-bit signed integers;
a xvi-scrap Unicode characters;
two 8-bit ISO-8859-1 characters.

Ans: (one) 42, 32810; (2) 42, -32726; (3) 0, 42; 128, 42; (4) 0, 42; -128, 42; (5) '*'; '耪'; (6) NUL, '*'; PAD, '*'.

REFERENCES & RESOURCES

(Floating-Betoken Number Specification) IEEE 754 (1985), "IEEE Standard for Binary Floating-Point Arithmetic".
(ASCII Specification) ISO/IEC 646 (1991) (or ITU-T T.50-1992), "It - 7-bit coded character set for information interchange".
(Latin-I Specification) ISO/IEC 8859-ane, "Information technology - 8-bit unmarried-byte coded graphic grapheme sets - Function 1: Latin alphabet No. 1".
(Unicode Specification) ISO/IEC 10646, "Information technology - Universal Multiple-Octet Coded Character Set (UCS)".
Unicode Consortium @ http://world wide web.unicode.org.

Source: https://www3.ntu.edu.sg/home/ehchua/programming/java/datarepresentation.html

Posted by: eidsonthadell57.blogspot.com