A Summary of Jon Louis Bentley's
``Writing Efficient Programs''

Lawrence A. Crowl
29 August 1990

This summary of Jon Louis Bentley's book ``Writing Efficient Programs'' consists of selected text from that book with editorial changes, primarily removing references to specific examples.

Methodology

The following steps are essential parts of a methodology of building efficient computing systems.

The most important properties of a large system are a clean design and implementation, useful documentation, and a maintainable modularity. The first steps in the programming process should therefore be the design of a solid system and the clean implementation of that design.
If the overall system performance is not satisfactory, then the programmer should monitor the program to identify where the scarce resources are being consumed. This usually reveals that most of the time is used by a few percent of the code.
Proper data structure selection and algorithm design are often the key to large reductions in the running time of the expensive parts of the program. The programmer should therefore try to revise the data structures and algorithms in the critical modules of the system.
If the performance of the critical parts is still unsatisfactory, then use the techniques below to recode them. The original code should usually be left in the program as documentation.
If additional speed is still needed, then the programmer should work at lower design levels, including hand-written assembly code, operating system modifications, microcode, and special-purpose hardware design.

Applying the Rules

Once the hot spots of a system have been isolated, there are four critical steps in applying the following rules.

Identify the code to be changed.: We should identify the code to be changed by monitoring the program, and then concentrate on the parts that are taking the majority of the time.
Choose a rule and apply it.: Once we know what code to change, we see if any of our rules can be used to change it. The rules are presented in groups as they relate to different parts of programs; we should therefore identify whether the expense of our program is going to data structures, loops, logic, procedures or expressions, and then search in the appropriate list for a candidate rule. When we apply a rule, we should make sure that the application preserves the correctness of the program; this is usually done by applying the spirit, if not the actual formulas, of program verification.
Measure the effect of the modification.: Removing a common subexpression in a traveling salesman tour program is typical of many changes we make: it appears that it would increase the program's speed by a factor of two but in fact it gives less than a three precent improvement. Even if we believe that we understand the effect of a transformation by reasoning alone, it is absolutely necessary to support that analysis with observation; we often find that we are quite mistaken.
Document the resulting program.: The final modified program should include a description of the clean code and of the modification that was incorporated for efficiency. That description can range from a brief comment to including a copy of the original code enclosed within comment characters and a thorough discussion of the rule used to modify it (with an appropriate reference to this book, of course).

Fundamental Rules

These fundamental rules underlie any application of the more detailed rules below.

Code Simplification:: Most fast programs are simple. Therefore, keep code simple to make it faster.
Problem Simplification:: To increase efficiency of a program, simplify the problem it solves.
Relentless Suspicion:: Question the necessity of each instruction in a time-critical piece of code and each field in a space-critical data structure.
Early Binding:: Move work forward in time. Specifically, do work now just once in hope of avoiding doing it many times later.

Space for Time Rules

Introducing redundant information can decrease run time at the cost of increasing space used. (The complements of these rules may apply.)

Data Structure Augmentation:: The time required for common operations on data can often be reduced by augmenting the structure with extra information or by changing the information within the structure so that it can be accessed more easily. Examples are reference counters and hints.
Store Precomputed Results:: The cost of recomputing an expensive function can be reduced by computing the function only once and storing the results. Subsequent requests for the function are then handled by table lookup rather than by computing the function.
Caching:: Data that is accessed most often should be the cheapest to access. (Caching can ``backfire'' and increase the run time of a program if locality is not present in the underlying data.)
Lazy Evaluation:: The strategy of never evaluating an item until it is needed avoids evaluations of unnecessary items.

Time for Space Rules

Recomputing information can decrease a program's space requirements, at the cost of increased execution time. (The complements of these rules may apply.)

Packing:: Dense storage representations can decrease storage costs by increasing the time required to store and retrieve data. (Decreasing data space may increase code space.)
Overlaying:: Storing data items that are never simultaneously active in the same memory space reduces data space. Code overlaying reduces code space by using the same storage for routines that are never simultaneously needed. Many operating systems provide this service automatically in their virtual memory systems.
Interpreters:: The space required to represent a program can often be decreased by the use of interpreters in which common sequences of operations are represented compactly. (Finite state machines are simple, compact interpreters.)

Loop Rules

The efficiency rules used most frequently deal with loops for the simple reason that the hot spots in most programs involve loops.

Code Motion Out of Loops:: Instead of performing a certain computation in each iteration of a loop, it is better to perform it only once, outside the loop. (Code cannot be moved out of loops if it has side effects that are desired on every iteration.)
Combining Tests:: An efficient inner loop should contain as few tests as possible, and preferably only one. The programmer should therefore try to simulate some of the exit conditions of the loop by other exit conditions. Sentinels are a common application of this rule: we place a sentinel at the boundary of a data structure to reduce the cost of testing whether our search has exhausted the structure.
Loop Unrolling:: A large cost of some short loops is in modifying the loop indices. That cost can often be reduced by unrolling the loop.
Transfer-Driven Loop Unrolling:: If a large cost of an inner loop is devoted to trivial assignments, then those assignments can often be removed by repeating the code and changing the use of variables. Specifically, to remove the assignment I:=J, the subsequent code must treat J as though it were I.
Unconditional Branch Removal:: A fast loop should contain no unconditional branches. An unconditional branch at the end of a loop can be removed by ``rotating'' the loop to have a conditional branch at the bottom.
Loop Fusion:: If two nearby loops operate on the same set of elements, then combine their operational parts and use only one set of loop control operations.

Logic Rules

These rules deal with efficiency problems that arise when evaluating the program state by making various kinds of tests. They sacrifice clarity and robustness for speed of execution.

Exploit Algebraic Identities:: If the evaluation of a logical expression is costly, replace it by an algebraically equivalent expression that is cheaper to evaluate.
Short-Circuiting Monotone Functions:: If we wish to test whether some monotone nondecreasing function of several variables is over a certain threshold, then we need not evaluate any of the variables once the threshold has been reached. Short-circuit evaluation of boolean expressions is an example of this rule. (A more complex application of this rule exits from a loop as soon as the purpose of the loop has been accomplished.)
Reordering Tests:: Logical tests should be arranged such that inexpensive and often successful tests precede expensive and rarely successful tests.
Precompute Logical Functions:: A logical function over a small finite domain can be replaced by a lookup in a table that represents the domain.
Boolean Variable Elimination:: We can remove boolean variables from a program by replacing the assignment to a boolean variable V by an if-then-else statement in which one branch represents the case that V is true and the other represents the case that V is false. (This generalizes to case statements and other logical control structures.) This rule usually decreases time slightly (say, less than 25 percent), but greatly increases code space. More complex applications of this rule remove boolean variables from data structures by keeping separate structures for the true and false records.

Procedure Rules

Procedures form good abstraction boundaries, but a given boundary may have an unacceptable execution cost.

Collapsing Procedure Hierarchies:

The run times of the elements of a set of procedures that (nonrecursively) call themselves can often be reduced by rewriting procedures in line and binding the passed variables.

Exploit Common Cases:

Procedures should be organized to handle all cases correctly and common cases efficiently.

Coroutines:

A multiple-pass algorithm can often be turned into a single-pass algorithm by use of coroutines.

Transformations on Recursive Procedures:

The run time of recursive procedures can often be reduced by applying the following transformations:

Code the recursion explicitly by use of a program stack.
If the final action of a procedure P is to call itself recursively, replace that call by a goto to its first statement; this is usually known as removing tail recursion. That goto can often be transformed into a loop.
If a procedure contains only one recursive call on itself, then it is not necessary to store the return address on the stack.
It is often more efficient to solve small subproblems by use of an auxiliary procedure, rather than by recurring down to problems of size zero or one.

Parallelism:

A program should be structured to exploit as much of the parallelism as possible in the underlying hardware.

Expression Rules

Be careful that applying these rules does not work against your compiler.

Compile-Time Initialization:

As many variables as possible should be initialized before program execution.

Exploit Algebraic Identities:

If the evaluation of an expression is costly, replace it by an algebraically equivalent expression that is cheaper to evaluate.

An algebraic identity yields a fast range test that compiler writers can use on two's complement architectures.
We can often multiply or divide by powers of two by shifting left or right.
Strength reduction on a loop that iterates through the elements of an array replaces a multiplication by an addition. This technique generalizes to a large class of incremental algorithms.

Common Subexpression Elimination:

If the same expression is evaluated twice with none of its variables altered between evaluations, then the second evaluation can be avoided by storing the result of the first and using that in place of the second. (We cannot eliminate the common evaluation of an expression with important side-effects.)

Pairing Computation:

If two similar expressions are frequently evaluated together, then we should make a new procedure that evaluates them as a pair. Examples include sine/cosine and minimum/maximum.

Exploit Word Parallelism:

Use the full word width of the underlying computer architecture to evaluate expensive expressions. The bitwise representation of sets exploits this rule.

System Dependent Optimization

Some optimizations require an understanding of their effect on the underlying system.

The High-Level Language:

What are the relative costs of the various constructs in the language? What are the cheapest possible ways to accomplish common operations (i.e., input and output, or iterating through the elements of a vector)?

The Compiler:

What optimizations does the compiler perform, and what optimization should the programmer perform? Which language constructs encourage the compiler to perform optimizations, and which constructs block optimizations? Current optimizing compilers can move code out of loops, unroll loops, unroll transfer-driven loops, remove unconditional branches, exploit algebraic identities, short-circuit simple monotone functions, collapse procedure hierarchies, eliminate tail recursion, and eliminate common subexpressions.

The Machine Architecture:

Which efficient operations in the underlying hardware are easy to access, and how can the compiler give access to them? Is there an instruction buffering scheme that makes certain instruction organizations more efficient than others? What paging or caching schemes does the machine support, and how can the program's locality structure be organized to exploit the schemes? Issues include the following.

Instruction Costs:: Different instructions may have different relative costs for different implementations of an architecture.
Register Allocation:: Storing a variable in a register increases a program's efficiency in two ways. The running time is usually decreased because registers are usually realized by a faster technology than main memory and are architecturally (and usually physically) closer to the arithmetic and logical unit. A variable in a register also requires fewer bits to address; this means that the instruction is faster to access and that the overall program will require less space.
Instruction Caching:: Many high-speed computers have a cache for the recently accessed instructions; if a tight loop fits in that cache then it will run more quickly than if the instructions must be fetched repeatedly from main memory. Unrolling a loop such that it exceeds the size of the cache may reduce performance.
Storage Layout:: The layout of a program's storage in memory can have a significant effect on the execution speed of the program. Placing records of a length that is a power of two in arrays enables the multiplication of the array index by a left shift (which is often cheaper than multiplication by an arbitrary integer). However, banked memories are often implemented in powers of two, which interferes badly with this approach.
Virtual Memory Caching:: Knowledge of the interaction of the paging structure and the storage layout can have a profound effect on efficiency. Traverse row major matrices by row rather than column.

Lawrence A. Crowl, crowl@cs.orst.edu, 14 September 1995

A Summary of Jon Louis Bentley's``Writing Efficient Programs''