This is a purely informative rendering of an RFC that includes verified errata. This rendering may not be used as a reference.

The following 'Verified' errata have been incorporated in this document: EID 560

The following 'Verified' errata have been incorporated in this document: EID 560

Network Working Group R. Braden Request for Comments: 1071 ISI D. Borman Cray Research C. Partridge BBN Laboratories September 1988 Computing the Internet Checksum Status of This Memo This memo summarizes techniques and algorithms for efficiently computing the Internet checksum. It is not a standard, but a set of useful implementation techniques. Distribution of this memo is unlimited. 1. Introduction This memo discusses methods for efficiently computing the Internet checksum that is used by the standard Internet protocols IP, UDP, and TCP. An efficient checksum implementation is critical to good performance. As advances in implementation techniques streamline the rest of the protocol processing, the checksum computation becomes one of the limiting factors on TCP performance, for example. It is usually appropriate to carefully hand-craft the checksum routine, exploiting every machine-dependent trick possible; a fraction of a microsecond per TCP data byte can add up to a significant CPU time savings overall. In outline, the Internet checksum algorithm is very simple: (1) Adjacent octets to be checksummed are paired to form 16-bit integers, and the 1's complement sum of these 16-bit integers is formed. (2) To generate a checksum, the checksum field itself is cleared, the 16-bit 1's complement sum is computed over the octets concerned, and the 1's complement of this sum is placed in the checksum field. (3) To check a checksum, the 1's complement sum is computed over the same set of octets, including the checksum field. If the result is all 1 bits (-0 in 1's complement arithmetic), the check succeeds. Suppose a checksum is to be computed over the sequence of octets A, B, C, D, ... , Y, Z. Using the notation [a,b] for the 16-bit integer a*256+b, where a and b are bytes, then the 16-bit 1's complement sum of these bytes is given by one of the following: [A,B] +' [C,D] +' ... +' [Y,Z] [1] [A,B] +' [C,D] +' ... +' [Z,0] [2] where +' indicates 1's complement addition. These cases correspond to an even or odd count of bytes, respectively. On a 2's complement machine, the 1's complement sum must be computed by means of an "end around carry", i.e., any overflows from the most significant bits are added into the least significant bits. See the examples below. Section 2 explores the properties of this checksum that may be exploited to speed its calculation. Section 3 contains some numerical examples of the most important implementation techniques. Finally, Section 4 includes examples of specific algorithms for a variety of common CPU types. We are grateful to Van Jacobson and Charley Kline for their contribution of algorithms to this section. The properties of the Internet checksum were originally discussed by Bill Plummer in IEN-45, entitled "Checksum Function Design". Since IEN-45 has not been widely available, we include it as an extended appendix to this RFC. 2. Calculating the Checksum This simple checksum has a number of wonderful mathematical properties that may be exploited to speed its calculation, as we will now discuss. (A) Commutative and Associative As long as the even/odd assignment of bytes is respected, the sum can be done in any order, and it can be arbitrarily split into groups. For example, the sum [1] could be split into: ( [A,B] +' [C,D] +' ... +' [J,0] ) +' ( [0,K] +' ... +' [Y,Z] ) [3] (B) Byte Order Independence The sum of 16-bit integers can be computed in either byte order. Thus, if we calculate the swapped sum: [B,A] +' [D,C] +' ... +' [Z,Y] [4] the result is the same as [1], except the bytes are swapped in the sum! To see why this is so, observe that in both orders the carries are the same: from bit 15 to bit 0 and from bit 7 to bit 8. In other words, consistently swapping bytes simply rotates the bits within the sum, but does not affect their internal ordering. Therefore, the sum may be calculated in exactly the same way regardless of the byte order ("big-endian" or "little-endian") of the underlaying hardware. For example, assume a "little- endian" machine summing data that is stored in memory in network ("big-endian") order. Fetching each 16-bit word will swap bytes, resulting in the sum [4]; however, storing the result back into memory will swap the sum back into network byte order. Byte swapping may also be used explicitly to handle boundary alignment problems. For example, the second group in [3] can be calculated without concern to its odd/even origin, as: [K,L] +' ... +' [Z,0] if this sum is byte-swapped before it is added to the first group. See the example below. (C) Parallel Summation On machines that have word-sizes that are multiples of 16 bits, it is possible to develop even more efficient implementations. Because addition is associative, we do not have to sum the integers in the order they appear in the message. Instead we can add them in "parallel" by exploiting the larger word size. To compute the checksum in parallel, simply do a 1's complement addition of the message using the native word size of the machine. For example, on a 32-bit machine we can add 4 bytes at a time: [A,B,C,D]+'... When the sum has been computed, we "fold" the long sum into 16 bits by adding the 16-bit segments. Each 16-bit addition may produce new end-around carries that must be added. Furthermore, again the byte order does not matter; we could instead sum 32-bit words: [D,C,B,A]+'... or [B,A,D,C]+'... and then swap the bytes of the final 16-bit sum as necessary. See the examples below. Any permutation is allowed that collects all the even-numbered data bytes into one sum byte and the odd- numbered data bytes into the other sum byte. There are further coding techniques that can be exploited to speed up the checksum calculation. (1) Deferred Carries Depending upon the machine, it may be more efficient to defer adding end-around carries until the main summation loop is finished. One approach is to sum 16-bit words in a 32-bit accumulator, so the overflows build up in the high-order 16 bits. This approach typically avoids a carry-sensing instruction but requires twice as many additions as would adding 32-bit segments; which is faster depends upon the detailed hardware architecture. (2) Unwinding Loops To reduce the loop overhead, it is often useful to "unwind" the inner sum loop, replicating a series of addition commands within one loop traversal. This technique often provides significant savings, although it may complicate the logic of the program considerably. (3) Combine with Data Copying Like checksumming, copying data from one memory location to another involves per-byte overhead. In both cases, the bottleneck is essentially the memory bus, i.e., how fast the data can be fetched. On some machines (especially relatively slow and simple micro-computers), overhead can be significantly reduced by combining memory-to-memory copy and the checksumming, fetching the data only once for both. (4) Incremental Update Finally, one can sometimes avoid recomputing the entire checksum when one header field is updated. The best-known example is a gateway changing the TTL field in the IP header, but there are other examples (for example, when updating a source route). In these cases it is possible to update the checksum without scanning the message or datagram. To update the checksum, simply add the differences of the sixteen bit integers that have been changed. To see why this works, observe that every 16-bit integer has an additive inverse and that addition is associative. From this it follows that given the original value m, the new value m', and the old checksum C, the new checksum C' is: C' = C + (-m) + m' = C + (m' - m) 3. Numerical Examples We now present explicit examples of calculating a simple 1's complement sum on a 2's complement machine. The examples show the same sum calculated byte by bye, by 16-bits words in normal and swapped order, and 32 bits at a time in 3 different orders. All numbers are in hex. Byte-by-byte "Normal" Swapped Order Order Byte 0/1: 00 01 0001 0100 Byte 2/3: f2 03 f203 03f2 Byte 4/5: f4 f5 f4f5 f5f4 Byte 6/7: f6 f7 f6f7 f7f6 --- --- ----- ----- Sum1: 2dc 1f0 2ddf0 1f2dc dc f0 ddf0 f2dc Carrys: 1 2 2 1 -- -- ---- ---- Sum2: dd f2 ddf2 f2dd Final Swap: dd f2 ddf2 ddf2 Byte 0/1/2/3: 0001f203 010003f2 03f20100 Byte 4/5/6/7: f4f5f6f7 f5f4f7f6 f7f6f5f4 -------- -------- -------- Sum1: 0f4f7e8fa 0f6f4fbe8 0fbe8f6f4 Carries: 0 0 0 Top half: f4f7 f6f4 fbe8 Bottom half: e8fa fbe8 f6f4 ----- ----- ----- Sum2: 1ddf1 1f2dc 1f2dc ddf1 f2dc f2dc Carrys: 1 1 1 ---- ---- ---- Sum3: ddf2 f2dd f2dd Final Swap: ddf2 ddf2 ddf2 Finally, here an example of breaking the sum into two groups, with the second group starting on a odd boundary: Byte-by-byte Normal Order Byte 0/1: 00 01 0001 Byte 2/ : f2 (00) f200 --- --- ----- Sum1: f2 01 f201 Byte 4/5: 03 f4 03f4 Byte 6/7: f5 f6 f5f6 Byte 8/: f7 (00) f700 --- --- ----- Sum2: 1f0ea Sum2: f0ea Carry: 1 ----- Sum3: f0eb Sum1: f201 Sum3 byte swapped: ebf0 ----- Sum4: 1ddf1 Sum4: ddf1 Carry: 1 ----- Sum5: ddf2 4. Implementation Examples In this section we show examples of Internet checksum implementation algorithms that have been found to be efficient on a variety of CPU's. In each case, we show the core of the algorithm, without including environmental code (e.g., subroutine linkages) or speci