How computers (usually) store decimals
This page is a work in progress!
By Eldrick Chen, creator of calculusgaming.com
The computers we use every day have to do lots of calculations! At this very moment, the device you’re reading this on is most likely doing millions, maybe even billions of calculations every second. But in order for any computer to do calculations, it needs a way to store numbers.
Specifically, on this page, I’ll be focusing on how computers store decimals. But before we can understand this, we first need to learn about binary, the way computers are able to store any sort of number at all.
Here’s a very simple question: What is \(5 + 7\)? Type your answer here:
That’s correct! But what I would like you to focus on is not the answer itself, but how you entered it in. You had to type in two digits to express the answer 12. But why exactly do we need two digits for the number 12, when numbers like 4 or 8 can be represented using only one digit?
The reason is because we use the decimal system to write numbers, which uses a base of 10. What this means is that we only have 10 unique digits: the digits 0 through 9. If we want to write a number greater than 9, we need to use more than one digit.
The way we express the integer after 9 is by resetting the rightmost digit to 0 and adding a 1 to the beginning to get the representation “10”. To keep counting past 10, we keep increasing the rightmost digit until we get to 19, then we once again reset the rightmost digit to 0 and increment the digit one place left to it to get “20”.
Once both the ones and the tens digit reach their maximum values of 9, we again reset both of these digits to 0 and add another 1 at the beginning (i.e. the integer after 99 is 100).
Now that we’ve reviewed how humans count, let’s move on to how computers count. Internally, computers don’t use 10 different digits: they only use 2 digits: 0 and 1. This is because computer hardware is simpler to design when there’s only 2 possible values for each digit.
Therefore, instead of the base 10 system that humans use, computers use binary, or base 2.
0 and 1 in binary are represented the same in decimal in binary. But how is 2 represented in binary? Remember, computers don’t have a “2” digit to work with; they only have the digits 0 and 1.
Remember that when we ran out of digits in the decimal system (i.e. after we reached 9), we reset the rightmost digit to 0 and added a 1 to the front of the number to get 10. Binary does this same thing, but much earlier: the integer after 1 in binary is actually 10 (since we’re resetting the rightmost digit and adding a 1 on the front).
This means that 2 in decimal is actually written as “10” in the binary system! If we keep counting, we find that 3 in decimal is “11” in binary, and now we’re stuck again. We have two digits in our binary representation that are at their maximum possible values, so what do we do?
This is analogous to when we have the number 99 in decimal. When this happens, in order for us to keep counting, we reset both digits to 0 and add another 1 in front to get 100. In binary, we do the same thing: the binary integer after 11 (which is 3 in decimal) is 100 (or 4 in decimal), obtained by resetting both rightmost digits to 0 and adding another 1 in the front.
We can keep counting like this to obtain larger and larger binary numbers. Every time a digit exceeds 1, we reset it to 0 and increment the digit to the left of it. If we’re incrementing a binary number whose digits are all 1, we reset them all to 0 and add another 1 in the front.
Decimal | Binary |
---|---|
0 | 0 |
1 | 1 |
2 | 10 |
3 | 11 |
4 | 100 |
5 | 101 |
6 | 110 |
7 | 111 |
8 | 1000 |
9 | 1001 |
10 | 1010 |
When we write numbers in the base 10 decimal system, each digit has a place value: the amount each digit contributes to the overall value based on its location within the number. For example, in the number 52, the 2 is in the ones place and the 5 is in the tens place.
To find out what number “52” represents in decimal, we take each digit and multiply it by its place value. The “5” is in the 10s place and the “2” is in the 1s place, so this means that \(52 = \class{red}{5} \times \class{blue}{10} + \class{green}{2} \times \class{purple}{1}\).
Here is a table demonstrating this:
Digit | 5 | 2 |
Place Value | 10s | 1s |
Value of Digit | \(5 \times 10\) | \(2 \times 1\) |
To find the value of this number, we add up the products on the last row (in this case \(5 \times 10 + 2 \times 1\)). So when we write the number “52”, what we really mean is 5 tens added to 2 ones, or \(5 \times 10 + 2 \times 1\).
Here’s a more complicated example for the number 2,417:
Digit | 2 | 4 | 1 | 7 |
Place Value | 1,000s | 100s | 10s | 1s |
Value of Digit | \(2 \times 1{,}000\) | \(4 \times 100\) | \(1 \times 10\) | \(7 \times 1\) |
Summing up the products on the last row of the table, we get that the number 2,417 really means \(2 \times 1{,}000 + 4 \times 100 + 1 \times 10 + 7 \times 1\).
Notice the pattern with the place values: each time we move left one position, the place value is multiplied by 10. In addition, the rightmost digit represents ones. So the second digit from the right represents 10s, the third digit from the right represents 100s, and so on.
We can represent this with powers of 10. To get the value of a decimal number, we multiply the rightmost digit by \(1 = 10^0\), the second digit from the right by \(10 = 10^1\), the third digit from the right by \(100 = 10^2\), and so on. Here’s the table for the number 2,417 but with the place values written as powers of 10:
Digit | 2 | 4 | 1 | 7 |
Place Value | \(10^3 = 1{,}000\) | \(10^2 = 100\) | \(10^1 = 10\) | \(10^0 = 1\) |
Value of Digit | \(2 \times 10^3\) | \(4 \times 10^2\) | \(1 \times 10^1\) | \(7 \times 10^0\) |
Using powers of 10, we can say that \(2{,}417 =\) \(2\times 10^3 + 4\times 10^2 + 1\times 10^1 + 7\times 10^0\).
The way numbers are stored in binary is similar to how they’re stored in decimal, except that instead of multiplying by 10 each time we move a place to the left, we multiply by 2 instead.
Here’s an example with the binary number 110100:
Digit | 1 | 1 | 0 | 1 | 0 | 0 |
Place Value | 32s | 16s | 8s | 4s | 2s | 1s |
Value of Digit | \(1 \times 32\) | \(1 \times 16\) | \(0 \times 8\) | \(1 \times 4\) | \(0 \times 2\) | \(0 \times 1\) |
Notice how the rightmost place value is still ones, but every time we move a spot to the left, we multiply the place value by 2.
To get the value of this number, we add the values on the bottom row: \(1 \times 32 + 1 \times 16 + 0 \times 8\) \( + \: 1 \times 4 + 0 \times 2 + 0 \times 1 = 52\).
For binary specifically, there is a shortcut we can use: to get the value of a binary number, we add up the place values of all 1s in the number. In this case, we get \(32 + 16 + 4 = 2\).
We can also write this table with powers of 2:
Digit | 1 | 1 | 0 | 1 | 0 | 0 |
Place Value | \(2^5\) | \(2^4\) | \(2^3\) | \(2^2\) | \(2^1\) | \(2^0\) |
Value of Digit | \(1 \times 2^5\) | \(1 \times 2^4\) | \(0 \times 2^3\) | \(1 \times 2^2\) | \(0 \times 2^1\) | \(0 \times 2^0\) |
We can once again use the trick of only adding up place values with a 1 in them to get that the value of this binary number is \(2^5 + 2^4 + 2^2 = 52\).
Using binary, we can write down any integer using only the two digits 0 and 1!
Now that I’ve explained binary, we can get to how computers actually store numbers. Computers can store positive integers using their binary representation, but there is a caveat. Let’s walk through an example: how can a computer represent the number 52?
The number 52 in binary is 110100, so you might think that a computer stores the number 52 as “110100”. However, it’s actually not that simple, because computers like to work in groups of 8 binary digits (also called “bits”). A group of 8 bits is known as a byte.
When a computer stores a number, it has to dedicate a certain amount of memory to storing this number. These are the typical choices:
Let’s say that the computer uses one byte to store the number 52. The actual binary representation of 52 is 110100, which is only 6 bits long, so we need to add two leading zeros to get the actual representation in memory: 00110100.
What about the number 500? In binary, this number is 111110100. However, note that this binary representation is 9 bits long, so it won’t fit in a single 8-bit byte. Therefore, if we want to store the entire number, we have to use at least 2 bytes. If we use 2 bytes, we can write this number as 00000001 11110100.
Storing numbers like this is common on computers, and a number stored in this way is known as an unsigned integer. With a bit of modification (that I won’t get into, because after all this is a page about floating-point numbers), we can create a signed integer system: a number format that can store positive as well as negative integers.
So it is fairly straightforward to store integers in computers, but how can we store decimals? Here’s one idea:
Just like how we can write decimals in the decimal system (such as 0.3 and 2.419), we can also write decimals in binary. In decimal, every time we move one position to the right, the place value is divided by 10:
Digit | 2 | . | 4 | 1 | 9 |
Place Value | \(10^0 = 1\) | \(10^{-1} = 0.1\) | \(10^{-2} = 0.01\) | \(10^{-3} = 0.001\) | |
Value of Digit | \(2 \times 10^0\) | \(4 \times 10^{-1}\) | \(1 \times 10^{-2}\) | \(9 \times 10^{-3}\) |
We can do something similar in binary. For example, here’s the number 1.0101 in binary:
Digit | 1 | . | 0 | 1 | 0 | 1 |
Place Value | \(2^0 = 1\) | \(2^{-1} = 0.5\) | \(2^{-2} = 0.25\) | \(2^{-3} = 0.125\) | \(2^{-4} = 0.0625\) | |
Value of Digit | \(1 \times 2^0\) | \(0 \times 2^{-1}\) | \(1 \times 2^{-2}\) | \(0 \times 2^{-3}\) | \(1 \times 2^{-4}\) |
The value of this binary number is \(1 + 0.25 + 0.0625 = 1.3125\).
Because we can store decimals in binary, one idea for how to represent decimals in computers is to simply use their binary representations. For example, if we wanted to store 1.3125 on a computer, we could simply store “1.0101”.
However, there is a problem with this. How do we store the decimal point? Remember that data in computers is made of purely 0s and 1s, so we can’t just store a decimal point directly, nor can we use a bit to represent it (since we’re already using the bits 0 and 1 to represent the value of our number and we don’t have any other digits available).
One idea is to just implicitly store the decimal point at a certain location. For example, we can dedicate a byte of memory to store the 8 bits before the decimal point, and another byte of memory to store the 8 bits after the decimal point. This gives us a 16-bit data type that allows us to store decimals.
Using this data type, known as a fixed-point number (because the decimal point is fixed at a certain location), we can store the number 1.3125 as 00000001 01010000 (remember, there is an implicit decimal point between the two bytes). The first byte 00000001 represents the integer part of the number (in this case 1), and the second byte 01010000 represents the decimal part of the number (in this case 0.3125). The four trailing zeros in the second byte are there to pad the decimal part out to 8 bits.
Enter a fixed-point number in binary to get its decimal representation:
Enter a number in decimal to get its binary fixed-point representation:
This works decently well, but fixed-point numbers still have a few weaknesses. For example, let’s say we use 32 bits to represent a fixed-point number (32 bits is a very common number of bits to use to represent numbers in computers). Let’s say we dedicate 16 bits to the integer part and 16 bits to the fractional part. What is the largest number we can represent?
The largest number we can represent with this format is achieved by setting all of the bits to 1. This number is 11111111 11111111 11111111 11111111 in our fixed-point format, or about 65,535.9999847 in decimal. For comparison, the largest number we can represent with a 32-bit unsigned integer is 4,294,967,295. It would be ideal if we could store larger numbers, like 1,000,000, while still only using 32 bits.
In addition, what about very small numbers? The smallest positive number our fixed-point format can store is 00000000 00000000 00000000 00000001, or about 0.0000153. It would also be nice if we could represent even smaller numbers in our format. (Try entering 0.000001 into the decimal-to-fixed-point converter: what do you get?)
It turns out that there is a way for us to fix both of these problems, and the solution is known as floating-point numbers!
The main cause of the aforementioned problems is that the gap between representable values is constant. For example, the two smallest positive numbers we can represent with our 32-bit fixed-point format are about 0.0000153 and 0.0000305. The gap between these two numbers is about 0.0000153.
The two largest numbers we can represent are about 65,535.9999847 and 65,535.9999694. Once again, the gap between these two numbers is about 0.0000153.
But the gap between consecutive numbers doesn’t have to be constant. What if we designed a system where the gap scales based on the size of our number (i.e. we made the gap between small numbers very small and the gap between large numbers larger)?
This is how the floating-point system works. However, the floating-point system is a bit complicated to explain at first, so I’ll start with an analogous system in decimal.
Let’s start with a riddle. Here are the rules:
Can you think of a strategy can you use to earn the extra credit on my assignment? Typically, with 7 digits, the only non-negative integers you can represent are 0 to 9,999,999. So how can you represent numbers larger than this with only 7 digits?
Ideally, this strategy should be as simple as possible, meaning you shouldn’t have to do lots of complicated calculations just to convert each answer into a form that only uses 7 digits or less. You should also try to avoid hacky workarounds such as replacing every digit with a symbol. (Hint: if you’ve taken a high school science class before, you might already know the answer!)
A simple solution is to use scientific notation: a compact way of writing large and small numbers.
To understand the idea behind scientific notation, let’s say you had to write the number 1 trillion. Normally, you would write that as 1,000,000,000,000, but that takes up a lot of space, and you’re repeating the digit 0 a lot. Instead, you could just write that as “1 followed by 12 zeros”, or even more concisely as \(10^{12}\). If we instead wanted to write 5 trillion instead, we could write that as \(5 \times 10^{12}\).
This is the idea behind scientific notation: using powers of 10 to shorten how we write large or small numbers. For example, the number 1 followed by 100 zeros (also known as a googol) can be written simply as \(10^{100}\).
A number in scientific notation has three parts: a sign, a coefficient (sometimes called “mantissa”), and an exponent. For example, the number 5 trillion written in scientific notation is \(\class{red}{5} \times 10^\class{blue}{12}\). Here, the sign is positive, the coefficient is 5, and the exponent is 12.
Typically when we write numbers in scientific notation, we want the coefficient to be in between 1 and 10 (including 1 but excluding 10). For example, 50 trillion could be written as \(50 \times 10^{12}\) or \(5 \times 10^{13}\), but you’ll see the latter more often.
There are two benefits of using this coefficient restriction:
Here are some more examples of scientific notation:
Number | Scientific Notation |
---|---|
12.3 | \(1.23 \times 10^1\) |
24,180,000 | \(2.418 \times 10^7\) |
0.00064 | \(6.4 \times 10^{-4}\) |
\(2^{100}\) | \(\approx 1.2677 \times 10^{30}\) |
100 factorial \(= 100 \times 99 \times \cdots \times 2 \times 1\) |
\(\approx 9.3326 \times 10^{157}\) |
Speed of light (in meters per second) |
\(\approx 3 \times 10^8\) |
Enter a number to see it displayed in scientific notation:
This is the solution to our puzzle: since we only need 3 significant figures for our answers, using scientific notation allows us to write any number up to \(9.99 \times 10^{99}\) with only 7 digits! For example, the number 123,000,000 can be written as \(1.23 \times 10^8\), and the number 78,900,000,000 can be written as \(7.89 \times 10^{10}\). Notice how we only need 7 digits or less to write these numbers in scientific notation.
The name “floating-point” comes from the fact that in scientific notation, the coefficient represents the first few digits of a number and exponent tells you where to place the decimal point within those digits (i.e. by changing the exponent, the decimal point can “float” between digits). For example, for the number \(56.78 = 5.678 \times 10^1\), the coefficient 5.678 tells you the digits of the number and the exponent 1 tells you to shift the decimal point in the coefficient one position to the right to get 56.78.
This puzzle might feel arbitrary (what teacher would limit you to only using 7 digits on your answers?), but this is very similar to the struggles that computer designers and programmers had to face. Remember that we only have a certain number of bits (such as 32 or 64) to store each number in computer memory. So how can we most efficiently use this limited number of bits to store a wide range of numbers?
The solution is to use scientific notation, but in binary. Because we’re working in binary, it’s much easier to use 2 as the base of our scientific notation representation instead of 10. For example, the number 65,536 would be written as \(2^{16}\), and the number 98,304 would be written as \(1.5 \times 2^{16}\).
Remember that a number in scientific notation has three parts:
(Technically the base we’re using (in this case 2) is also part of each number in scientific notation, but since computers always store numbers in binary, this base will never change. So we don’t actually need to store the 2 as part of every number.)
How can we store these parts in a single binary number? Let’s go through each component one by one.
This is the most straightforward: a nonzero real number can be either positive or negative (we’ll worry about zero later). Since there are only two possibilities for the sign of a number, we can store it with a single bit: 0 for positive and 1 for negative (this is because negative signed integers typically start with a 1).
This time, we’ll need to use multiple bits to represent the exponent (if we only use one bit to store the exponent, we can only store exponents of 0 and 1, which isn’t very useful). In addition, we’ll need a way to store negative exponents, for very small numbers such as \(0.0001 = 1.6384 \times 2^{-14}\).
Remember how I said that in decimal scientific notation, we ideally want the coefficient to be in between 1 and 10? With binary scientific notation, we ideally want the coefficient to be in between 1 and 2 (including 1 and excluding 2). Let’s look at a few examples of binary scientific notation:
Number (in decimal) | Scientific Notation |
---|---|
3 | \(1.5 \times 2^1\) |
1,000 | \(\approx 1.9531 \times 2^9\) |
12,345,678 | \(\approx 1.4717 \times 2^{23}\) |
0.002 | \(1.024 \times 2^{-9}\) |
We already have a way to store decimals in binary, so we can use this to store the coefficient!
Before we do that though, do you notice something about the coefficients in the above table? I said that the coefficient should always be in between 1 and 2, and so because of that, the first digit of the coefficient is always 1.
This means that we don’t actually need to bother with storing this 1 in the first place, since we’ll already know that it’s there! (There is one exception to this, and that’s with the number zero, but we’ll worry about that later.)
Now that we’ve discussed the three parts of a binary floating-point number, it’s time to dive into the details of how these parts are actually stored on a computer. For example, how many bits do we use for the coefficient and exponent? There are multiple different standards, but the most widely used standard is known as IEEE 754 (because the standard was created by the Institute of Electrical and Electronics Engineers, or IEEE).
The IEEE 754 standard is how most computers store floating-point numbers (also known as floats). Remember that numbers stored on a computer might take up different amounts of memory. The most common amounts for floats are 32 bits (4 bytes) and 64 bits (8 bytes). There are different standards for both of these sizes, but they work very similarly.
The IEEE 754 standard specifies how many bits are to be used for each of the parts of a floating-point number. For 32-bit floats, the distribution is:
For 64-bit floats, these amounts are used:
Let’s go through each of these components.
As mentioned before, the sign bit tells us if a number is positive or negative. If the sign bit is 0, the number is positive, and if the sign bit is 1, the number is negative.
Let’s focus on just the 32-bit standard for now. 8 bits are used for the exponent, and with 8 bits, we can store integers from 0 to 255.
However, we also need a way to store negative exponents. There are a few ways to store negative integers in binary, but what the IEEE 754 standard does is take the raw exponent (the binary number represented by the 8 exponent bits) and subtract 127 (this 127 is known as the exponent bias).
As an example, if the exponent bits are 01000000 (which would normally represent 64), the exponent is actually \(64 - 127 = -63\).
By using this biased exponent format, the range of exponents we can store becomes -127 to 128 instead of 0 to 255.
For 32-bit floats, IEEE 754 gives us 23 bits to store the coefficient. As mentioned previously, the coefficient in binary scientific notation will almost always start with a 1, so we don’t bother storing that 1. For example, if the coefficient is 1.5 (which in binary is 1.1), we only have to store the decimal part of the coefficient, which is “.1”. In addition, we don’t store the decimal point and we need to add trailing zeros until we have 23 bits, so the actual representation of a coefficient of 1.5 is 10000000000000000000000.
However, because we only have 23 bits, not every decimal can be stored exactly. For example, 1.1 in decimal is written as 1.0001100110011001100... as binary (the 1100 pattern repeats forever). When we encounter a coefficient like this, we must round it to 23 bits in order for it to fit into our floating-point number. So the coefficient 1.1 is stored as 00011001100110011001101 (the last 1 is there because of rounding).
Let’s say we want to store the number \(\pi = 3.141592...\) as a 32-bit floating-point number. The first step is to write the number in binary scientific notation in a way such that the coefficient is in between 1 and 2. In this case, \(\pi = 1.570796... \times 2^1\). Now let’s break down this number into its parts: the sign, exponent, and coefficient.
The sign bit is the most straightforward. \(\pi\) is a positive number, so we set the sign bit to 0.
The exponent in this case is 1. However, remember that floats use a biased exponent, so we need to correct for this bias. To do this, we add 127 to the exponent for a value of 128. In binary, this number is stored as 10000000. (If we had a binary number shorter than 8 bits, we would need to add leading zeros until we had 8 bits. In this case, we don’t need to do that.)
The coefficient in our case is \(\pi/2 = 1.570796...\), which has a decimal expansion that never ends. However, because we only have 23 bits for the coefficient, there’s no way for us to store the infinitely many digits of this coefficient.
So what we have to do is use the available coefficient that is closest to the true coefficient of \(\pi/2\). That coefficient is \(13{,}176{,}795/2^{23} \approx 1.57079637\). Now we need to write that coefficient in binary, which is 1.10010010000111111011011. Remember that we drop the leading 1 when writing the coefficient, so our final coefficient bits are 10010010000111111011011. (If our binary representation of the coefficient has less than 23 bits after the point, we would need to add leading zeros until we have 23 bits.)
These are the binary representations of each part we’ve found so far:
To turn this into a floating-point number, we simply join all the parts together. Therefore, \(\pi\) as a 32-bit floating-point number is 01000000010010010000111111011011.
Now let’s take the float we just obtained, 01000000010010010000111111011011, and convert it back to an integer. The sign is positive, the exponent is 1, and the coefficient is 1.57079637..., so our number is \(1.57079637... \times 2^1 = 3.14159274...\) However, if we look at the actual value of \(\pi\), we get \(3.14159265...\), slightly off from our floating-point value. What happened?
The problem is that because we only have 23 coefficient bits, we had to do some rounding in order to convert our number into a valid float. This means that floats are limited in how precise their values can be.
This can cause problems in some cases. For example, if you typed 0.1 + 0.2
in most programming languages, what would you expect the output to be?
If you guessed 0.3
, you would be wrong. What you actually get is 0.30000000000000004
. This is because the numbers 0.1 and 0.2 can’t be stored exactly in the floating-point format. When you store 0.1 into a 64-bit floating-point number, the actual number stored is very slightly greater than 0.1, and the same is true for 0.2. So when you add the two numbers together, this error is enough for the computer to give the wrong answer!
There is still a problem with this implementation of floating-point. Remember how I said that the coefficient always has a leading 1 that isn’t actually stored? This means that every number we store in this format must have a coefficient between 1 and 2, and this makes it impossible to store zero.
To fix this, we can use the smallest exponent available, -127, to just mean zero (i.e. if a float has an exponent of -127, interpret the float as having a value of zero instead).
However, doing this means that you now have \(2^{24}\) different ways to write zero, since there are \(2^{24}\) floating point numbers with an exponent of -127 (since there are 23 coefficient bits and 1 sign bit). It would be nice if we could use these \(2^{24}\) numbers to not only store zero, but also store very small numbers.
This is where subnormal numbers come in. The way subnormal numbers work is that if the exponent is at the lowest possible value of -127, we actually add one to the exponent and ignore the implied leading 1 in the coefficient. This means that the lowest possible exponent is actually -126, and all subnormal numbers have this exponent.
For example, if we wanted to store the number \(10^{-40} \approx 1.0889 \times 2^{-133}\), we would need to write it with an exponent of -126 to get \(0.008507... \times 2^{-126}\). Now we need to write the coefficient 0.008507... in binary to get 0.00000010001011011000010. We only store the decimal part of the coefficient, so our final coefficient is 00000010001011011000010.
With subnormal numbers, because the exponent is the smallest possible exponent, it is stored in binary as 00000000.
Therefore, the 32-bit floating-point number for \(10^{-40}\) is 00000000000000010001011011000010.
To store zero as a floating-point number, we simply set all of the bits to zero, so the floating-point representation of zero is simply 00000000000000000000000000000000.
Interestingly, the IEEE 754 standard has a representation for “negative zero”, obtained by setting the sign bit to 1 and all other bits to 0. 10000000000000000000000000000000 represents negative zero.
Note that if a floating-point number is too small to be stored, it will be rounded down to zero. For example, there is no way to store \(10^{-50}\) in a 32-bit floating-point number (the smallest representable positive number is about \(1.4 \times 10^{-45}\)), so trying to store \(10^{-50}\) in a 32-bit float will simply result in zero.
Another problem occurs with very large numbers. Currently, the largest number we can store is found by setting all but the first bit to 1: the floating-point number 01111111111111111111111111111111 is equal to about \(\class{blue}{1.99999988} \times 2^\class{purple}{128} \approx 6.8 \times 10^{38}\). However, what happens if we calculate a number that’s larger than that, such as \(2^{500} \approx 3.27 \times 10^{150}\)?
One possible solution is to just set the floating-point number to its largest possible value whenever the result of a calculation is too large, but this can lead to very misleading results. For example, if we said that \(2^{500}\) is equal to the maximum possible float \(6.8 \times 10^{38}\), our answer would be off by over 100 orders of magnitude!
A better solution is to have a special value that indicates that the result of our calculation is too large to be stored in a float. In the IEEE 754 standard, this value is known as “Infinity”, and is stored by setting all of the exponent bits to 1 and all of the coefficient bits to 0. Therefore, the float 01111111100000000000000000000000 represents infinity.
Likewise, negative infinity can be stored by setting the sign bit to 1: 11111111100000000000000000000000 represents negative infinity.
However, “infinity” in floating-point is not the same as infinity in math. It simply represents a number that is too large to be stored in a floating-point number. There are some similarities between infinity in math and infinity in floating-point (for example, infinity in floating-point is larger than any other floating-point number, just like how infinity is larger than any real number in math), but just because a calculation returns the floating-point number infinity doesn’t mean that the result is actually infinite.
What happens when we perform an invalid operation, such as trying to divide 0 by 0? With integer arithmetic, a common result is that the program would simply crash or a meaningless value would be returned.
But what happens in floating-point? It turns out the IEEE 754 standard has another special value for situations like these: Not a Number, often abbreviated to NaN.
NaN is stored by setting all the exponent bits to 1 and the coefficient bits to anything that isn’t 0. For example, 01111111100000000000000000000001 is one possible way to store NaN.
Enter a floating-point number in binary to get its decimal representation: