Spreadsheet Inspection Experiments

Ray Panko's Spreadsheet Research Website

Synopsis

Humans are very accurately when we work. When we write a line of program code, create a spreadsheet formula, or do other simple but nontrivial actions, we are correct 95% to 99% of the time.

Unfortunately, we are not as good at finding errors. When we look diligently for errors, we only find 40% to 90%, and finding 90% is rare. We would like to know how effective we are at finding errors when we inspect a spreadsheet. This not just an "academic" concern. If our spreadsheets have as many errors as they seem to have, then we must test them more. One way to do this is to inspect them, cell by cell. Many people believe that if we are careful enough, and if we spend enough time, we can find all of the errors in a spreadsheet, program, or other human cognitive artifact. However, this ability is not observed in reality. We need to move beyond intuition and use solid data to understand how much inspection we will need to do on spreadsheets. If our detection rates are too low, then we will need to do team inspection in order to catch 80% to 90% of the errors.

This page presents information on spreadsheet development experiments that asked participants to detect errors in spreadsheet models. In these experiments, 1025 participants found an average of 60% of the errors in the spreadsheets they examined.

This low detection rate suggests that team inspection will be needed. Several research findings suggest that applying code inspection techniques to spreadsheet inspection will be fruitful.

Other findings include the following:

In a nonrigorous "experiment" participants at a EuSpRIG conference, inspected a spreadsheet to detect errors. The spreadsheet was based on the Wall task and included the most common errors made by participants during development. They detected 70% of the errors.This may be inflated due to the fact that two groups of participants took initiative (cheated) by working in teams rather than individuals. No individual or team detected all errors. At the end, the moderator said that he was sure that they could have found all errors given more time.

Summary Results

Table 1 presents the key results from these experiments. The key take-away is that ten studies have measured error detection; on average, respondents found 60% of these errors. Although not all studies reported the distribution of error detection success rates, it appears that few if any of the 1,025 subjects found all errors. In the Panko (1999) study, for example, none did.

Table 1: Spreadsheet Error Detection Experiments: Summary

Study Sample Size Percentage Detected or Corrected Seeded Errors?
Howe and Simkin, 2006 228 67% Yes
Panko, 1999 60 63% Yes
McKeever & McDaid, 2010 29 63% Yes
Teo and Lee-Partridge, 2001 382 62% Yes
Bishop & McDaid, 2007 45 62% Yes
Galletta et al. 1993 60 56% Yes
Aurigemma and Panko, 2011 43 56% No
Galletta et al. 1997 113 51% Yes
Anderson, 2004 65 37% Yes
Overall 1025 60%  

One issue in these studies is that most used seeded errors, in which the experimenter created the errors in the spreadsheet. This is dangerous, because the seeding might result in an atypical spectrum of error rate types. If error detection rates vary with type of error, as they do, then seeding might produce anomalous results. Only one study used errors were actually produced by real developers (Aurigemma and Panko 2011). Each participant inspected one spreadsheet developed by a participant in a previous experiment that used different subjects. Collectively, the participants discovered 56% of the errors in the spreadsheets. This is consistent with with results from other studies, so the seeded error problem does not appear to be severe, at least for relatively simple spreadsheets.

Error Detection in the Anderson Experiment

The study Anderson (2004) conducted was based on the task used by Galletta et al. (1997) but added errors in more difficult equations, for example in countif calculations and in a vlookup operation with IF conditions. Clearly, in this more difficult material, detection rates was lower. Interestingly, although Anderson's 65 working professionals average 6.8 years working with spreadsheet and spent an average of 5.3 hours per week using spreadsheets, they had lower detection rates for the tasks taken directly from the Galletta et al. study than did participants in the Galletta et al. study, including those with very little spreadsheet development experience. Inspection is a very tiring task, and the Anderson participants may have been fatigued. Limiting module length and inspection time has long been a hallmark of software code inspection (Cohen 2006, Fagan 1976 1986, Gilb and Graham 1993).

Table 2: Error Detection in the Anderson (2004) Experiment

Error Detection Rate
Omits a variable 58%
Range is incorrect 57%
Sum counts additional cells 57%
Operator precedence error 57%
Typed 25 instead of C25 55%
Counif error 51%
Formula adds a wrong cell 42%
Formula adds instead of multiplies 38%
Typographical error in data 32%
Sums values when should not 17%
Value is stored as text 15%
Countif error 8%
Vlookup error with IF conditions 6%
Y2K problem 2%
Overall 37%

Source: Anderson (2004)

Group Code Inspection

Software code inspection is done by teams rather than individuals because individual error detection rates are too low. To mirror Fagan software inspection, undergraduate MIS majors were given a spreadsheet with seeded errors plus requirements for building the spreadsheet (Panko 1999). They inspected the spreadsheet individually for 45 minutes, then met in groups of three to record errors and agree whether they were truly errors or not. While participants working alone found 63% of the errors on average, groups of three averaged 83%. This may seem like a modest gain for tripling the cost of inspection, but Table 3 shows that there were much larger gains in difficult-to-detect errors.

Table 3: Error Detection Rates for Participants Working Alone and in Groups of Three

Error Type Working Alone Working in Triads
All 63% 83%
Formula 73% 93%
Omission 33% 55%
Short Formulas 84% 94%
Long Formulas 53% 84%

Source: Panko (1999)

Detailed Results

Table 4 gives selected detailed results from spreadsheet inspection experiments.

Table 4: Detailed Results from Experiments to Detect Errors in Spreadsheets

Study Description Sample % of Errors Corrected / Detected Remarks
Galletta, et al., 1993 Six spreadsheets, each with one device (mechanical) and one domain (logic) error. 30 MBA students and 30 CPAs taking continuing education course Detected Errors CPAs found significantly more errors than MBAs, spreadsheet experience made no difference in detection rate, although experienced spreadsheet users worked faster. All differences were small.
  Total Sample 60 56%  
  CPA novices (<100 hrs.) 15 57%  
  CPA experts (>250 hrs.) 15 66%  
  MBA students, novice 15 52%  
  MBA students, experts 15 48%  
  Domain (logic) errors   46%  
  Device (mechanical) errors   51%  
Galletta, et al., 1997 Student budgeting task on a single worksheet. 113 MBA Students Detected Errors  
  Overall   51%  
  On screen   45%  
  On paper   55%  
Panko and Sprague, 1998 Students developed spreadsheets from the Wall task. After they had built their models, they inspected their own models. 18 subjects with initial errors. Corrected errors Only for subjects with incorrect spreadsheets. Other subjects did not change their spreadsheets.
      18%  
Panko, 1999 Modified version of Galletta, et. al, 1997. 60 Under-graduate MIS students in IT class. Worked alone and then in groups of 3 to develop a joint list of errors Detected Errors Followed a modification of the Fagan (1976) approach. Compared detection for omission versus formula errors and for long and short formulas.
  Alone / Groups of 3   63% / 83% p = 0.009
  Alone formula / omission   73% / 33% p < 0.001
  Group formula / omission   93% / 55% p < 0.001
  Alone short / long formula   84% / 53% p < 0.001
  Group short / long formula   99% / 84% p < 0.001
Teo and Lee-Partridge, 2001 Two variants of Panko Wall Task. Students did one on paper and the other online. We consider only the first task, in which the student did UNCLE or DAD on paper. 382 undergraduate students in skills class. 17 different sections assigned order for UNCLE and DAD randomly. Detected Errors  
  Overall quantitative   62%  
  Mechanical   83%  
  Logic   60%  
  Omission   43%  
Anderson, 2004 Based on Galletta et al. 1997 with added errors for a total of 15 errors. Web analog of spreadsheet. 45 min. allowed. 65 professionals. Average age 37.0. Mean SS usage 6.8 years and 5.3 hr/week. Detected Errors On the 8 spreadsheet errors in Galletta's task, participants found 4.2 on average. Anderson's subjects found 3.5.
      37%  
Howe and Simkin, 2006 Choi Construction company data and projections (4 worksheets) 228 students in IT classes in two schools Corrected Errors  
  All errors [43 errors]   67%  
  Data entry [5]   72%  
  Clerical & non-material [10]   66%  
  Rules violation [3]   60%  
  Formula [25]   54%  
Bishop & McDaid, 2007 Modification of Howe and Simkin, 2006 task. 13 industry-based professionals, 34 accounting and Finance students Corrected Errors There was a moderate to strong correlation between whether a cell was edited or the person spent >0.3 seconds in a cell and error correction success.
  All errors [42] All/Prof/Stud/ 62% / 72% / 58%  
  Data entry [8] " 64% / 68% / 63%  
  Clerical & non-material [4] " 13% / 17% / 11%  
  Rules violation [4] " 71% / 85% / 65%  
  Formula [26] " 67% / 79% / 63%  
McKeever and McDaid, 2010 Task used in Bishop & McDaid [2007] task with range names in the experimental group. Computer students. 14 had SSs with range names, 15 did not. Corrected Errors In contrast with a study presented in the previous year, this study used only a few range names to avoid overloading subjects.
  All errors [38 errors] All / Range Names / None 63% / 59% / 66% p = 0.05
  Data entry [8] " 65% / 63% /66% p = 0.29
  Clerical & non-material [4] " 26% / 23% / 28% p = 0.26
  Rules violation [4] " 66% / 71% / 60% p = 0.81
  Formulas total " 68% / 63% /74% No hypothesis
  Formula logic [6] " 78% / 73% / 82% p = 0.14
  Formula reference [7] " 75% / 66% / 84% p = 0.12
  Formula name [9] " 56% / 51% / 60% p = 0.16
Aurigemma and Panko, 2009 Microslo task required the previous subjects to develop a pro forma income statement from a data statement. In MIS course, 43 subjects each looked for errors in one spreadsheet developed by previous subjects building a Microslo task model.   Only inspection study in which errors were not seeded by the experimenter but were made by previous subjects building the spreadsheet. 84% of the errors were logical, and of the remaining slips and lapses, only 38% were slips such as typographical errors and pointing errors. Most seeded spreadsheets used primarily execution errors, especially slips.
  All errors   56%  
  Mistakes (Logical)   48%  
  Execution (slips, lapses)   63% Detection rate differences between mistakes and slips and lapses is not significant

Opportunities for Research

Given similarities between software and spreadsheet development error rates and error detection rates, the problem of widespread errors in spreadsheets is probably a case of inadequate testing. Spreadsheet development is easy to learn, although difficult to master. Spreadsheet testing is just plain hard. For spreadsheet development to be worthy of the name professional, it must include a large amount of testing, perhaps a quarter or more of all development resources.

Give our blindness to most errors we make, our intuition about how to reduce errors should be taken as highly suspect. Intuitions about how to improve error detection should be suspect for the same reason. We should test every prescription for safety and effectiveness.

We need to determine if tools that aid us in testing will really improve things. For instance, one possibility is to double-click on each formula in order and checking its precedents. Will this reduce errors, and is it better to check formulas as we develop a spreadsheet module or after we are satisfied with the module and view it as finished? Or should we do both?

One tool that is often touted is the static analysis program, which flags specific patterns in spreadsheets (called smells in software engineering static analysis.) These programs also help visualize the logical structure of spreadsheets. Experiments have shown that SAPs functionality for flagging worked fairly well on seeded errors. However, when Aurigemma and Panko (2010) applied two SAPs to errors created by users, the flagging tool worked poorly.

In software engineering, Fagan (1976 1986) has been proven to be a safe and effective way to inspect a spreadsheet for errors. Hard experience has led to a multi-stage inspection process, the requirement for multiple inspectors, and limitations on module length and inspection times, among other things. Will these apply to spreadsheet inspection.

In dynamic testing, the tester creates test cases with expected results (oracles) and runs them against a program module. When the cases are well-defined, their execution can be automated. This permits regression testing, in which all test cases are run after every change. This is an area that has not been explored in spreadsheets. It can be implemented in either VBA within Excel or externally via Visual Studio or similar tools. Again, software engineers in test (the professional name for testers) have developed guidelines for how to develop test cases based on equivalence classes and other practices.

Whatever type of testing will be done, there will be a need for an error tracking system that will facilitate follow-up on error reports and fixes.

References

Allwood, C. M. (1984) Error detection processes in statistical problem solving, Cognitive Science. 8(4), 413-437.

Aurigemma, S. A. & Panko, R. R. (2010, July 9-11) The detection of human spreadsheet errors by humans versus inspection (auditing) software, EuSpRIG 2010 Proceedings, Greenwich, UK, Sponsored by the European Spreadsheet Risks Interest Group (EuSpRIG).

Ayalew, Y., Clermont, M., Mittermeir, R., & Klagenfurt, A. (2000). Detecting errors in spreadsheets. Spreadsheet Risks, Audit and Development Methods, 1, 51-62.

Bishop, B., & McDaid, K. (2007). An empirical study of end-user behaviour in spreadsheet error detection & correction, Proceedings of the European Spreadsheet Risk Interest Group Conference.

Butler, R. J. (2000). Is this spreadsheet a tax evader? How HM Customs and Excise test spreadsheet applications, Hawaii International Conference on System Sciences. Published by the IEEE Computer Society.

Clermont, M., & Mittermeir, R. (2003). Auditing large spreadsheet programs, Proceedings of International Conference on Information Systems Implementation and Modelling (pp. 87-97).

Clermont, M., Hanin, C., & Mittermeir, R. (2002). A spreadsheet auditing tool evaluated in an industrial context, Spreadsheet Risks, Audit and Development Methods, 3, 35-46.

Cragg, P. G., & M. King (1993). Spreadsheet modelling abuse: An opportunity for OR? Journal of the Operational Research Society. 44(8) (1993) 743-752.

Croll, G.J. (2003). A typical model audit approach: Spreadsheet audit methodologies in the City of London. IFIP, Integrity and Internal Control in Information Systems, 124, pp. 213-219, Kluwer, 2003, http://aps.arxiv.org/abs/0712.2591.

Davies, N., & C. Ikin (1987, December). Auditing spreadsheets, Australian Accountant. 54-56.

Dent, A. Personal communication with the author via electronic mail, April 2, 1995.

Fagan, M. E. (1976). Design and code inspections to reduce errors in program development, IBM Systems Journal, 15(3), 182-211.

Fagan, M.E. (1986, July). Advances in software inspections, IEEE Transactions on Software Engineering, Vol. SE-12, No. 7, Page 744-751.

Galletta, D. F., Abraham, D., El Louadi, M., Lekse, W., Pollalis, Y. A., & Sampler, J. L. (1993). An empirical study of spreadsheet error-finding performance, Accounting, Management and Information Technologies, 3(2), 79-95.

Galletta, D. F., Hartzel, K. S., Johnson, S. E., Joseph, J. L., & Rustagi, S. (1996). Spreadsheet presentation and error detection: An experimental study, Journal of Management Information Systems, 13(3), 63.

Gilb, Tom and Graham, Dorothy (1993). Software Inspection, Edinburgh Gate, England: Person Education, Limited.

Hicks, L. (1995). NYNEX, personal communication with the first author via electronic mail.

Howe, H., & Simkin, M. G. (2006). Factors affecting the ability to detect spreadsheet errors, Decision Sciences Journal of Innovative Education, 4(1), 101-122.

Joseph, Jimmie L. (2002). The effect of group size on spreadsheet error debugging, unpublished doctoral dissertation, Katz Graduate School of Business, University of Pittsburgh, Pittsburgh, Pennsylvania.

Klein, B. D., Goodhue, D. L. & Davis, G. B. (1997). Can humans detect errors in data? Impact of base rates, incentives, and goals, MIS Quarterly, 21(2), 169-194.

KPMG Management consulting, supporting the decision maker—A guide to the value of business modeling, press release, July 30, 1998.

MacKay, D. G. (1992). Errors, ambiguity, and awareness in language perception and production. In B. J. Baars (Ed.), Experimental Slips and Human Error: Exploring the Architecture of Volition (pp. 39-69). New York: Plenum.

Nixon, D., & O'Hara, M. (2001). Spreadsheet auditing software, Proceedings of the Second Symposium of the European Spreadsheet Risks Interest Group, Greenwich, UK.

Panko, R. R. (1999). Applying code inspection to spreadsheet testing. Journal of Management Information Systems, 16(2), 176.

Panko, Raymond R., End User Computing: Management, Applications, and Technology, Wiley: New York, 1988.

Panko, R. R. and Halverson, R. P. Jr. (1996, January). Spreadsheets on trial: A framework for research on spreadsheet risks, Proceedings of the Twenty-Ninth Hawaii International Conference on System Sciences, Volume II, Kihei, Maui, January, 1996, pp. 326-335.

Panko, R. R., & Aurigemma, S. (2010). Revising the Panko-Halverson taxonomy of spreadsheet errors. Decision Support Systems, 49(2010), 235–244

Panko, R.R., & Halverson, R.P., Jr. (1997, Spring). Are two heads better than one? (At reducing errors in spreadsheet modeling), Office Systems Research Journal, 15(1), 21-32.

Panko, R.R. & Sprague, R.H., Jr. (1988, April). Hitting the wall: Errors in developing and code Inspecting a “simple” spreadsheet model, Decision Support Systems, (22,4), April 1998, 337-353.

Plous, S. (1993). The psychology of judgment and decision making, Philadelphia, Pennsylvania: Temple University Press.

Rickets, J. A. (1990, March). Powers-of-ten information biases, MIS Quarterly (14:1) March 1990, pp. 63-77.

Teo, T. S. H., & Lee-Partridge, J. E. (2001). Effects of error factors and prior incremental practice on spreadsheet error detection: an experimental study. Omega, 29(5), 445-456.

Teo, T.S.H. & Tan, M., “Quantitative and Qualitative Errors in Spreadsheet Development,” Proceedings of the Thirtieth Hawaii International Conference on System Sciences, Kihei, Hawaii, January 1997, to be published.