“This is just a standard management technique that has been used by personnel supervisors since the days of.. Ho Lu, grand emperor of the Wu dynasty.”

This is assignment 1 due during week 2, and it is related to this post for school. Download it from here in order to print it out.

In the early 1990s, a piece of telephony software written by the DSC Communications Corp. caused some phone outages across the United States. The root cause of the problem was traced back to a typing mistake. The bug occurred in three lines of a million line program. At the time, Frank Perpiglia, Vice President for technology and poduct development at DSC had said that the intended change to the software was very minor, and that it had been determined that the usual rounds of testing of the software was not needed. In hindsight, that was a bad idea.

Yet, to put the blame on insufficient testing is to ignore the other factors which contributed to the software bug. In fact, I would say that the problem really resides in the policies and procedures that allowed the developers to release their software with less than adequate testing. A symptom of this lack of quality-focused policies can be found in the cavilier comment from Perpiglia, “We had a small modification to make a small change. We felt that the change itself did not require three to four months of testing.” It was sheer folly to accept the risk of not testing the change.

First, let’s focus on DSC’s software configuration management policy, if they had one. A change was going to take place on their software. Was there a need for the change? Was the fix to address a bug or was it relatively minor? From Perpiglia’s statement, it seems that DSC deemed the fix to be a minor, inoccuous change. It probably was also somewhat necessary. I can’t believe that a simple patch of less than several lines of code would be released at the whims of the company. From this viewpoint, did software configuration management accept the necessity to change the software and at whose discretion?

Secondly, let’s take a look at this issue from the software developers position in relation to their processes and procedures. To make the change requires at least a minimal set of informal tests. Was there a procedure for code/peer review? Even in an informal setting with several peers and domain experts maybe the flaw could’ve been spotted. It literally was a typo that was the problem. At the least, was there a procedure to conduct unit tests on the change? Perhaps a unit test could’ve caught the flaw early exposing the unit under testing as not performing correctly. A set of procedures for the developers would’ve been helpful in spotting the problem.

Finally, focus should fall on the system test group. Risk was taken to not test any of the software, because running a full set of qualification tests was time consuming. Still, could they have developed a reduced set of tests to ensure that the change would not produce any bugs? They knew the functional area affected, they could’ve run a smaller set of tests specifically focused on that small requirements which were affected. Did they have sufficient tests for that functional area or was the bug the type where the code changes in one part of the program and wholly unrelated piece fails? Sometimes the a small set of tests would not uncover the bug, and the full complement of qualification test is needed to discover it.

In summary, the problem with the DSC software could be traced back to problems with the policies and procedures that were in place. They were lacking, because it allowed them to assume the risk that an insignificant change to their software would not produce any anamolies. DSC neither planned for the change, reviewed the change, nor adequately tested the change. It was not just the lack of test time that did them in, but a lack of focus on policies and procedures that would’ve helped them focus on delivering a quality product.

Posted by broderic

Yo! I'm the writer here. Super sauce.

One Reply to ““This is just a standard management technique that has been used by personnel supervisors since the days of.. Ho Lu, grand emperor of the Wu dynasty.””

  1. Talk about getting exercise by jumping to conclusions.

    The software your talking about was real time telcom software. “Patches” [which were hexadecimal codes to change the software code dynamically, on the fly] were used to make dynamic changes to operations software. Knowing the risks that this entailed, several procedural steps were in place to have each “patch” checked by at least two other programmers, before inserting a patch into the running system.

    What happened? The procedure wasn’t followed by the one person putting in the patch. Simple as that.

    Now dear writer, what “policy and procedure” do you propose to fix the problem that a programmer simply screwed up? Shoot the programmer? Fire him?

    In fact, what we did was to immediately reward the entire team [including the offending programmer] who quickly isolated the source of the problem. We did that to avoid the temptation to play the blame game, which the media was hyping. [And 15 years later, you seem to be doing also.]

    And several months later, the key Bell Companies affected by the outages, visited DSC headquarter to give out awards to the people who made the mistake.

    Why did the affected customers do that? Because the people developing the SCP code were truly one of the finest groups of programmers around.

    Lastly, the outages created a huge political climate in DC, so anything you read about this topic is certain to be clouded by that environment.
    Be careful before drawing conclusions based on very limited accounts in the media.

    Frank Perpiglia

Comments are closed.