Chapter 4: The Crash of the AT&T Network in 1990



The Root Problem

The cause of the problem had come months before. In early December, technicians had upgraded the software to speed processing of certain types of messages. Although the upgraded code had been rigorously tested, a one-line bug was inadvertantly added to the recovery software of each of the 114 switches in the network. The defect was a C program that featured a break statement located within an if clause, that was nested within a switch clause.
In pseudocode, the program read as follows:



1  while (ring receive buffer not empty 
          and side buffer not empty) DO

2    Initialize pointer to first message in side buffer
     or ring receive buffer

3    get copy of buffer

4    switch (message)

5       case (incoming_message):

6             if (sending switch is out of service) DO

7                 if (ring write buffer is empty) DO

8                     send "in service" to status map

9                 else

10                    break

                  END IF

11           process incoming message, set up pointers to
             optional parameters

12           break
       END SWITCH


13   do optional parameter work