I give that Web site an 11

"Pushing the Envelope" from ACM <interactions>, May/June 2006

By Fred Sampson
wfreds@acm.org

I have always been skeptical of statistics. Maybe it’s just fear of the unknown—my brain is demonstrably not wired for math. But everyone knows that statistics can be made to lie, tweaked and manipulated and interpreted to mean nearly anything, to support or justify nearly any position or goal. I’m even more skeptical of statistical methods applied to measurements of what are arguably qualitative factors. So why should statistics about usability or user experience design or HCI be any different?

There has been plenty of discussion over recent years about how many users it takes to produce valid conclusions about usability. Despite my ignorance about statistical methods, I have no trouble understanding that you will get more valid information from tests with dozens or hundreds of participants than you will by having three to five users. I can see the value of small test samples during iterative development, when the goal is to locate any glaring usability issues. But you can’t convince me that such a test will reveal all the issues that a larger, more comprehensive test will find. [5, 7]

Last year I had the opportunity to participate in an evaluation of some interface designs. The team agreed to quantify a set of heuristics and apply them to the designs, to gain some insight into which design might be best suited for use in a product. The evaluation was a dismal failure. As it turned out, each of the four participants applied the heuristics unevenly. We recorded our data in an unstable tool that provided inconclusive results, even after statistical analysis. The final report glossed over the statistical findings and focused instead on one team member’s qualitative evaluation of the interfaces’ appearance and features, with no reference to the quantitative values collected and analyzed. So much for quantifying heuristics.

Watching the recent winter Olympics, I was reminded again of my pet peeve about sports: the problem of measuring beauty, of quantifying subjective evaluations. While I enjoy watching attractive people dancing on ice to beautiful music, I have trouble thinking of it as a sport equivalent to downhill ski racing or speed skating. The fastest skier or skater wins; that’s pretty simple. But judging beauty and calling it sport? That’s problematic. At least the scoring system for figure skating was changed after the scandal at the 2002 Olympics. But the new system still includes judging components for interpretation, choreography, and performance.

Does HCI have a similar problem? Are we measuring the unmeasurable? Can we seriously apply Six Sigma analytics on qualitative factors? Can Six Sigma provide indications that we can put the same confidence in qualitative judgments as in quantitative?

Don Norman has convinced me that emotion is an important factor in successful product design. [6] Yet I suspect most measurement occurs on behavioral design, only one of Norman’s three aspects of emotional design. How do we measure joy? How about measuring fun? Does funology have metrics? [1] The name suggests that it’s the study of fun; how is fun measured? Is there a pleasure chart for emotional design, much like the pain quantifying charts used in today’s hospitals? How do you quantify the pleasure of first opening your iPod’s packaging, or holding the gadget in your hands for the first time?

If we asked users to quantify their pleasure on a ten-point scale, how many would insist on “turning it up to 11”? *

And what of the old management theory adage that you manage what you measure, or the equally important corollary, that employees (or designers?) pay attention only to what’s being measured (and thereby rewarded or punished)? “What gets measured is what gets done.” [4] If you can’t measure it, can you design for it? Will users find what you measure in usability tests to be important to them just because you measured it? Or will designers consider a factor important only if you measure it, or only if usability gurus make pronouncements about it because they managed to measure it? Are there very real but really unmeasurable qualities and factors that affect user experience, that don’t or can’t have metrics associated with them?

We know that when schools and teachers are evaluated on the basis of improvements in standardized test scores, they start teaching what the tests measure. Many educators feel, however, that such tests have the wrong effect: they’re producing successful test-takers, not successful students. If you can’t measure how students learn to think for themselves, perhaps they won’t learn to think.

Be careful what you measure: it may be what you get.

On the other hand, I am gratified to see that Stanford researcher B.J. Fogg and colleagues, when evaluating the credibility of Web sites, count user responses but do not try to quantify aspects such as “design look” or “information focus.” [2] Credible Web sites look good, just as we know that tall, attractive people are somehow seen as more credible than the rest of us. See, for instance, Malcolm Gladwell on the predominance of tall CEOs, and the presidential good-looks (and dismal performance) of President Warren Harding. [3] But please don’t insult me further by trying to quantify looking good; we all know that there are no tens. **

The September/October [subsequently delayed to November/December 2006] issue of ACM <interactions> will focus on how we measure HCI and usability. Contact guest editor Jeff Sauro at jeff@measuringusability.com if you have something to say about measuring usability. Here’s your chance to convince me.

References

1. Blythe, M., Hassenzahl, M., Wright, P., ed. (2004). More Funology. ACM interactions. Volume 11, Issue 5, September/October 2004.

2. Fogg, B.J., Soohoo, C., Danielson, D., Marable, L., Stanford, J., Tauber, E. (2003). How Do Users Evaluate the Credibility of Web Sites? Proceedings of the 2003 conference on Designing for user experiences.

3. Gladwell, M. (2005). Blink: The Power of Thinking Without Thinking. New York: Little, Brown, and Company.

4. McCauley, L. (1999). Measure What Matters. Fast Company, May 1999. http://fastcompany.com/magazine/24/one.html

5. Nielsen, J. (2004). Risks of quantitative studies. http://www.useit.com/alertbox/20040301.html

6. Norman, D. (2004) Emotional Design. New York: Basic Books.

7. Sauro, J. (2004). The risks of discounted qualitative studies. http://www.measuringusability.com/qualitative_risks.htm

Footnotes

*In the 1984 satirical film This is Spinal Tap, one of the musicians famously insisted that the volume knob on his amplifier go up to 11 so that it would be louder than those that stopped at 10.

**Blake Edwards’ 1979 film 10 starred Bo Derek, whose beauty presumably qualified her as a 10 on a scale of 1-10. Writer Shel Silverstein subsequently opined in song that “on a scale of ten to one, friend...there ain't no tens.”

About the Author

Fred Sampson is a co-chair of BayDUX, Vice President for Finance of SIGCHI(as of July 2006), and a senior member of STC. In his spare time, Fred works as an information developer at at IBM’s Silicon Valley Lab in San Jose, California. Contact him at wfreds@acm.org.

Copyright Notice

© ACM 2006. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM <interactions>, Volume XIII.3, ISSN 1072-5520, (May/June 2006), http://doi.acm.org/10.1145/1109069.1109077.