Spreadsheet Inspection Experiments

Ray Panko's Spreadsheet Research Website

Synopsis

Humans are very accurately when we work. When we write a line of program code, create a spreadsheet formula, or do other simple but nontrivial actions, we are correct 95% to 99% of the time.

Unfortunately, we are not as good at finding errors. When we look diligently for errors, we only find 40% to 90%, and finding 90% is rare. We would like to know how effective we are at finding errors when we inspect a spreadsheet. This not just an "academic" concern. If our spreadsheets have as many errors as they seem to have, then we must test them more. One way to do this is to inspect them, cell by cell. Many people believe that if we are careful enough, and if we spend enough time, we can find all of the errors in a spreadsheet, program, or other human cognitive artifact. However, this ability is not observed in reality. We need to move beyond intuition and use solid data to understand how much inspection we will need to do on spreadsheets. If our detection rates are too low, then we will need to do team inspection in order to catch 80% to 90% of the errors.

This page presents information on spreadsheet development experiments that asked participants to detect errors in spreadsheet models. In these experiments, 1025 participants found an average of 60% of the errors in the spreadsheets they examined.

This low detection rate suggests that team inspection will be needed. Several research findings suggest that applying code inspection techniques to spreadsheet inspection will be fruitful.

Panko (1999) conducted an experiment based on Fagan's (1976 1986) code inspection methodology used in software. Individuals working alone found only 63% of the errors; when they were combined into teams of three, they found 83%. Although the improvement was only moderate, the detection rates for the most difficult-to-find errors rose substantially. This suggests that team inspection will be essential. See Table 3.
Fagan (1976 1986) inspection begins with the team being taken through the entire logic of a module in an initial meeting. Panko did not test this. Joseph (2002) did in his dissertation research. He found that the initial meeting had a major impact on subsequent error detection by the team.
In a spreadsheet inspection at NYNEX in 1995, Hicks conducted a spreadsheet inspection of a corporate budgeting spreadsheet by a team of three and found it to be a workable method. It discovered a billion-dollar error.
As noted later, Anderson (2004)'s subjects found fewer errors that Galletta et al.'s (1997) subjects on the same task despite being undertaken by professionals. Anderson speculated that by adding quite a few more errors to the task, subjects have faced a more difficult task overall. In code inspection, a key tenet is to reduce module length and inspection time.(Cohen 2006, Fagan 1976 1986)

Other findings include the following:

Different types of errors have different detection rates, so detection will vary from spreadsheet to spreadsheet depending upon types of error found in the spreadsheet (Anderson 2004). See Table 2.
Using the Panko and Halverson (1996) taxonomy of errors into mechanical errors (slips and lapses), logic errors, and omission errors, Teo and Lee-Partridge (1997) found that, as expected, detection rates were highest for mechanical errors and lowest for omission errors. Aurigemma and Panko (2010) found detection rates were higher for slips and lapses than for mistakes (logic errors). Howe and Simkin (2006) saw the same pattern but more weakly, while Bishop & McDaid (2007) and McKeever & McDaid (2010), who had small sample sizes, did not.
Most errors are logic errors rather than slips and lapses (Panko and Aurigemma 2010).
Studies that used working professionals (Anderson 2004; Bishop and McDaid, 2007; Galletta et al. 1993) had similar error detection rates to those that used student samples.

In a nonrigorous "experiment" participants at a EuSpRIG conference, inspected a spreadsheet to detect errors. The spreadsheet was based on the Wall task and included the most common errors made by participants during development. They detected 70% of the errors.This may be inflated due to the fact that two groups of participants took initiative (cheated) by working in teams rather than individuals. No individual or team detected all errors. At the end, the moderator said that he was sure that they could have found all errors given more time.

Summary Results

Table 1 presents the key results from these experiments. The key take-away is that ten studies have measured error detection; on average, respondents found 60% of these errors. Although not all studies reported the distribution of error detection success rates, it appears that few if any of the 1,025 subjects found all errors. In the Panko (1999) study, for example, none did.

Table 1: Spreadsheet Error Detection Experiments: Summary

Study	Sample Size	Percentage Detected or Corrected	Seeded Errors?
Howe and Simkin, 2006	228	67%	Yes
Panko, 1999	60	63%	Yes
McKeever & McDaid, 2010	29	63%	Yes
Teo and Lee-Partridge, 2001	382	62%	Yes
Bishop & McDaid, 2007	45	62%	Yes
Galletta et al. 1993	60	56%	Yes
Aurigemma and Panko, 2011	43	56%	No
Galletta et al. 1997	113	51%	Yes
Anderson, 2004	65	37%	Yes
Overall	1025	60%

One issue in these studies is that most used seeded errors, in which the experimenter created the errors in the spreadsheet. This is dangerous, because the seeding might result in an atypical spectrum of error rate types. If error detection rates vary with type of error, as they do, then seeding might produce anomalous results. Only one study used errors were actually produced by real developers (Aurigemma and Panko 2011). Each participant inspected one spreadsheet developed by a participant in a previous experiment that used different subjects. Collectively, the participants discovered 56% of the errors in the spreadsheets. This is consistent with with results from other studies, so the seeded error problem does not appear to be severe, at least for relatively simple spreadsheets.

Error Detection in the Anderson Experiment

The study Anderson (2004) conducted was based on the task used by Galletta et al. (1997) but added errors in more difficult equations, for example in countif calculations and in a vlookup operation with IF conditions. Clearly, in this more difficult material, detection rates was lower. Interestingly, although Anderson's 65 working professionals average 6.8 years working with spreadsheet and spent an average of 5.3 hours per week using spreadsheets, they had lower detection rates for the tasks taken directly from the Galletta et al. study than did participants in the Galletta et al. study, including those with very little spreadsheet development experience. Inspection is a very tiring task, and the Anderson participants may have been fatigued. Limiting module length and inspection time has long been a hallmark of software code inspection (Cohen 2006, Fagan 1976 1986, Gilb and Graham 1993).

Table 2: Error Detection in the Anderson (2004) Experiment

Error	Detection Rate
Omits a variable	58%
Range is incorrect	57%
Sum counts additional cells	57%
Operator precedence error	57%
Typed 25 instead of C25	55%
Counif error	51%
Formula adds a wrong cell	42%
Formula adds instead of multiplies	38%
Typographical error in data	32%
Sums values when should not	17%
Value is stored as text	15%
Countif error	8%
Vlookup error with IF conditions	6%
Y2K problem	2%
Overall	37%

Source: Anderson (2004)

Group Code Inspection

Software code inspection is done by teams rather than individuals because individual error detection rates are too low. To mirror Fagan software inspection, undergraduate MIS majors were given a spreadsheet with seeded errors plus requirements for building the spreadsheet (Panko 1999). They inspected the spreadsheet individually for 45 minutes, then met in groups of three to record errors and agree whether they were truly errors or not. While participants working alone found 63% of the errors on average, groups of three averaged 83%. This may seem like a modest gain for tripling the cost of inspection, but Table 3 shows that there were much larger gains in difficult-to-detect errors.

Table 3: Error Detection Rates for Participants Working Alone and in Groups of Three

Error Type	Working Alone	Working in Triads
All	63%	83%
Formula	73%	93%
Omission	33%	55%
Short Formulas	84%	94%
Long Formulas	53%	84%

Source: Panko (1999)

Detailed Results

Table 4 gives selected detailed results from spreadsheet inspection experiments.

Table 4: Detailed Results from Experiments to Detect Errors in Spreadsheets

Study	Description	Sample	% of Errors Corrected / Detected	Remarks
Galletta, et al., 1993	Six spreadsheets, each with one device (mechanical) and one domain (logic) error.	30 MBA students and 30 CPAs taking continuing education course	Detected Errors	CPAs found significantly more errors than MBAs, spreadsheet experience made no difference in detection rate, although experienced spreadsheet users worked faster. All differences were small.
	Total Sample	60	56%
	CPA novices (<100 hrs.)	15	57%
	CPA experts (>250 hrs.)	15	66%
	MBA students, novice	15	52%
	MBA students, experts	15	48%
	Domain (logic) errors		46%
	Device (mechanical) errors		51%
Galletta, et al., 1997	Student budgeting task on a single worksheet.	113 MBA Students	Detected Errors
	Overall		51%
	On screen		45%
	On paper		55%
Panko and Sprague, 1998	Students developed spreadsheets from the Wall task. After they had built their models, they inspected their own models.	18 subjects with initial errors.	Corrected errors	Only for subjects with incorrect spreadsheets. Other subjects did not change their spreadsheets.
			18%
Panko, 1999	Modified version of Galletta, et. al, 1997.	60 Under-graduate MIS students in IT class. Worked alone and then in groups of 3 to develop a joint list of errors	Detected Errors	Followed a modification of the Fagan (1976) approach. Compared detection for omission versus formula errors and for long and short formulas.
	Alone / Groups of 3		63% / 83%	p = 0.009
	Alone formula / omission		73% / 33%	p < 0.001
	Group formula / omission		93% / 55%	p < 0.001
	Alone short / long formula		84% / 53%	p < 0.001
	Group short / long formula		99% / 84%	p < 0.001
Teo and Lee-Partridge, 2001	Two variants of Panko Wall Task. Students did one on paper and the other online. We consider only the first task, in which the student did UNCLE or DAD on paper.	382 undergraduate students in skills class. 17 different sections assigned order for UNCLE and DAD randomly.	Detected Errors
	Overall quantitative		62%
	Mechanical		83%
	Logic		60%
	Omission		43%
Anderson, 2004	Based on Galletta et al. 1997 with added errors for a total of 15 errors. Web analog of spreadsheet. 45 min. allowed.	65 professionals. Average age 37.0. Mean SS usage 6.8 years and 5.3 hr/week.	Detected Errors	On the 8 spreadsheet errors in Galletta's task, participants found 4.2 on average. Anderson's subjects found 3.5.
			37%
Howe and Simkin, 2006	Choi Construction company data and projections (4 worksheets)	228 students in IT classes in two schools	Corrected Errors
	All errors [43 errors]		67%
	Data entry [5]		72%
	Clerical & non-material [10]		66%
	Rules violation [3]		60%
	Formula [25]		54%
Bishop & McDaid, 2007	Modification of Howe and Simkin, 2006 task.	13 industry-based professionals, 34 accounting and Finance students	Corrected Errors	There was a moderate to strong correlation between whether a cell was edited or the person spent >0.3 seconds in a cell and error correction success.
	All errors [42]	All/Prof/Stud/	62% / 72% / 58%
	Data entry [8]	"	64% / 68% / 63%
	Clerical & non-material [4]	"	13% / 17% / 11%
	Rules violation [4]	"	71% / 85% / 65%
	Formula [26]	"	67% / 79% / 63%
McKeever and McDaid, 2010	Task used in Bishop & McDaid [2007] task with range names in the experimental group.	Computer students. 14 had SSs with range names, 15 did not.	Corrected Errors	In contrast with a study presented in the previous year, this study used only a few range names to avoid overloading subjects.
	All errors [38 errors]	All / Range Names / None	63% / 59% / 66%	p = 0.05
	Data entry [8]	"	65% / 63% /66%	p = 0.29
	Clerical & non-material [4]	"	26% / 23% / 28%	p = 0.26
	Rules violation [4]	"	66% / 71% / 60%	p = 0.81
	Formulas total	"	68% / 63% /74%	No hypothesis
	Formula logic [6]	"	78% / 73% / 82%	p = 0.14
	Formula reference [7]	"	75% / 66% / 84%	p = 0.12
	Formula name [9]	"	56% / 51% / 60%	p = 0.16
Aurigemma and Panko, 2009	Microslo task required the previous subjects to develop a pro forma income statement from a data statement.	In MIS course, 43 subjects each looked for errors in one spreadsheet developed by previous subjects building a Microslo task model.		Only inspection study in which errors were not seeded by the experimenter but were made by previous subjects building the spreadsheet. 84% of the errors were logical, and of the remaining slips and lapses, only 38% were slips such as typographical errors and pointing errors. Most seeded spreadsheets used primarily execution errors, especially slips.
	All errors		56%
	Mistakes (Logical)		48%
	Execution (slips, lapses)		63%	Detection rate differences between mistakes and slips and lapses is not significant

Opportunities for Research

Given similarities between software and spreadsheet development error rates and error detection rates, the problem of widespread errors in spreadsheets is probably a case of inadequate testing. Spreadsheet development is easy to learn, although difficult to master. Spreadsheet testing is just plain hard. For spreadsheet development to be worthy of the name professional, it must include a large amount of testing, perhaps a quarter or more of all development resources.

Give our blindness to most errors we make, our intuition about how to reduce errors should be taken as highly suspect. Intuitions about how to improve error detection should be suspect for the same reason. We should test every prescription for safety and effectiveness.

We need to determine if tools that aid us in testing will really improve things. For instance, one possibility is to double-click on each formula in order and checking its precedents. Will this reduce errors, and is it better to check formulas as we develop a spreadsheet module or after we are satisfied with the module and view it as finished? Or should we do both?

One tool that is often touted is the static analysis program, which flags specific patterns in spreadsheets (called smells in software engineering static analysis.) These programs also help visualize the logical structure of spreadsheets. Experiments have shown that SAPs functionality for flagging worked fairly well on seeded errors. However, when Aurigemma and Panko (2010) applied two SAPs to errors created by users, the flagging tool worked poorly.

In software engineering, Fagan (1976 1986) has been proven to be a safe and effective way to inspect a spreadsheet for errors. Hard experience has led to a multi-stage inspection process, the requirement for multiple inspectors, and limitations on module length and inspection times, among other things. Will these apply to spreadsheet inspection.

In dynamic testing, the tester creates test cases with expected results (oracles) and runs them against a program module. When the cases are well-defined, their execution can be automated. This permits regression testing, in which all test cases are run after every change. This is an area that has not been explored in spreadsheets. It can be implemented in either VBA within Excel or externally via Visual Studio or similar tools. Again, software engineers in test (the professional name for testers) have developed guidelines for how to develop test cases based on equivalence classes and other practices.

Whatever type of testing will be done, there will be a need for an error tracking system that will facilitate follow-up on error reports and fixes.

References

Allwood, C. M. (1984) Error detection processes in statistical problem solving, Cognitive Science. 8(4), 413-437.

Aurigemma, S. A. & Panko, R. R. (2010, July 9-11) The detection of human spreadsheet errors by humans versus inspection (auditing) software, EuSpRIG 2010 Proceedings, Greenwich, UK, Sponsored by the European Spreadsheet Risks Interest Group (EuSpRIG).

Ayalew, Y., Clermont, M., Mittermeir, R., & Klagenfurt, A. (2000). Detecting errors in spreadsheets. Spreadsheet Risks, Audit and Development Methods, 1, 51-62.

Bishop, B., & McDaid, K. (2007). An empirical study of end-user behaviour in spreadsheet error detection & correction, Proceedings of the European Spreadsheet Risk Interest Group Conference.

Butler, R. J. (2000). Is this spreadsheet a tax evader? How HM Customs and Excise test spreadsheet applications, Hawaii International Conference on System Sciences. Published by the IEEE Computer Society.

Clermont, M., & Mittermeir, R. (2003). Auditing large spreadsheet programs, Proceedings of International Conference on Information Systems Implementation and Modelling (pp. 87-97).

Clermont, M., Hanin, C., & Mittermeir, R. (2002). A spreadsheet auditing tool evaluated in an industrial context, Spreadsheet Risks, Audit and Development Methods, 3, 35-46.

Cragg, P. G., & M. King (1993). Spreadsheet modelling abuse: An opportunity for OR? Journal of the Operational Research Society. 44(8) (1993) 743-752.

Croll, G.J. (2003). A typical model audit approach: Spreadsheet audit methodologies in the City of London. IFIP, Integrity and Internal Control in Information Systems, 124, pp. 213-219, Kluwer, 2003, http://aps.arxiv.org/abs/0712.2591.

Davies, N., & C. Ikin (1987, December). Auditing spreadsheets, Australian Accountant. 54-56.

Dent, A. Personal communication with the author via electronic mail, April 2, 1995.

Fagan, M. E. (1976). Design and code inspections to reduce errors in program development, IBM Systems Journal, 15(3), 182-211.

Fagan, M.E. (1986, July). Advances in software inspections, IEEE Transactions on Software Engineering, Vol. SE-12, No. 7, Page 744-751.

Galletta, D. F., Abraham, D., El Louadi, M., Lekse, W., Pollalis, Y. A., & Sampler, J. L. (1993). An empirical study of spreadsheet error-finding performance, Accounting, Management and Information Technologies, 3(2), 79-95.

Galletta, D. F., Hartzel, K. S., Johnson, S. E., Joseph, J. L., & Rustagi, S. (1996). Spreadsheet presentation and error detection: An experimental study, Journal of Management Information Systems, 13(3), 63.

Gilb, Tom and Graham, Dorothy (1993). Software Inspection, Edinburgh Gate, England: Person Education, Limited.

Hicks, L. (1995). NYNEX, personal communication with the first author via electronic mail.

Howe, H., & Simkin, M. G. (2006). Factors affecting the ability to detect spreadsheet errors, Decision Sciences Journal of Innovative Education, 4(1), 101-122.

Joseph, Jimmie L. (2002). The effect of group size on spreadsheet error debugging, unpublished doctoral dissertation, Katz Graduate School of Business, University of Pittsburgh, Pittsburgh, Pennsylvania.

Klein, B. D., Goodhue, D. L. & Davis, G. B. (1997). Can humans detect errors in data? Impact of base rates, incentives, and goals, MIS Quarterly, 21(2), 169-194.

KPMG Management consulting, supporting the decision maker—A guide to the value of business modeling, press release, July 30, 1998.

MacKay, D. G. (1992). Errors, ambiguity, and awareness in language perception and production. In B. J. Baars (Ed.), Experimental Slips and Human Error: Exploring the Architecture of Volition (pp. 39-69). New York: Plenum.

Nixon, D., & O'Hara, M. (2001). Spreadsheet auditing software, Proceedings of the Second Symposium of the European Spreadsheet Risks Interest Group, Greenwich, UK.

Panko, R. R. (1999). Applying code inspection to spreadsheet testing. Journal of Management Information Systems, 16(2), 176.

Panko, Raymond R., End User Computing: Management, Applications, and Technology, Wiley: New York, 1988.

Panko, R. R. and Halverson, R. P. Jr. (1996, January). Spreadsheets on trial: A framework for research on spreadsheet risks, Proceedings of the Twenty-Ninth Hawaii International Conference on System Sciences, Volume II, Kihei, Maui, January, 1996, pp. 326-335.

Panko, R. R., & Aurigemma, S. (2010). Revising the Panko-Halverson taxonomy of spreadsheet errors. Decision Support Systems, 49(2010), 235–244

Panko, R.R., & Halverson, R.P., Jr. (1997, Spring). Are two heads better than one? (At reducing errors in spreadsheet modeling), Office Systems Research Journal, 15(1), 21-32.

Panko, R.R. & Sprague, R.H., Jr. (1988, April). Hitting the wall: Errors in developing and code Inspecting a “simple” spreadsheet model, Decision Support Systems, (22,4), April 1998, 337-353.

Plous, S. (1993). The psychology of judgment and decision making, Philadelphia, Pennsylvania: Temple University Press.

Rickets, J. A. (1990, March). Powers-of-ten information biases, MIS Quarterly (14:1) March 1990, pp. 63-77.

Teo, T. S. H., & Lee-Partridge, J. E. (2001). Effects of error factors and prior incremental practice on spreadsheet error detection: an experimental study. Omega, 29(5), 445-456.

Teo, T.S.H. & Tan, M., “Quantitative and Qualitative Errors in Spreadsheet Development,” Proceedings of the Thirtieth Hawaii International Conference on System Sciences, Kihei, Hawaii, January 1997, to be published.