Probing and its Effects on the Validity and Reliability of Verbal Reports
Sally Abolrous
December, 2001
Introduction
History of Think-aloud Protocols
Pure Think-aloud Protocols
Level 1
verbalizations
Level 2
verbalizations
Level 3
verbalizations
Limitations of the Pure Think-aloud Method
Think-aloud Methods in Practice
“Keep Talking”
Acknowledgement Tokens
“What are you thinking right now?”
“John, could you tell us why you pressed the Enter Key?”
“Is there anything special that you’re looking for?”
“You seemed surprised/puzzled/frustrated, were you?”
“Was this task easy?”
“What features did you like and did not like?”
Conclusions and Recommendations
Eliciting
verbal reports from participants in usability studies is a commonly used method
used to collect performance and preference data. By asking users to
“think-aloud,” usability practitioners can observe users interact with an
interface and listen to their concurrent thoughts at the same time. Verbal data
is helpful because it allows observers to know how users think—what they look
for, how they expect to accomplish tasks, and what elements of the interface
they find confusing or helpful.
A
problem, however, is that think-aloud procedures vary widely among usability
practitioners. They vary in terms of the types of instructions given to
participants, the amount of interaction between the observers and the
participants, and the types of data being collected. Boren suggests that these
variances make it difficult to “compare or replicate studies, vouch for the
validity of the results, or teach a standard of practice to newcomers to the
field” (Boren[1], 266). He, therefore, suggests that
the methods employed by practitioners should be “theoretically motivated and
systematically applied” (Boren[1], 266).
One
widely accepted theory of verbal protocols is that of Ericsson and Simon.
Although, verbal protocols have been used since 1989, Ericsson and Simon
validated their use in 1980, and defined the standards and the very precise
conditions under which verbal protocols can be collected. However, usability
practitioners have strayed away from these set conditions and standards. One of
the major areas in which practitioners stray is probing participants in order
to collect otherwise uncollectible data. However, Ericsson and Simon suggest
that the kinds of probing done by practitioners can greatly influence the reliability
of the verbal protocols, can increase reactivity, bias the participant or
simply get invalid data.
In
this paper, I will give a brief introduction to Ericsson and Simon’s theory, then I will examine the different strategies employed by
usability engineers in order to collect data. I will also look into the
reliability and validity of the verbal data collected under each condition.
HISTORY OF THINK-ALOUD PROTOCOLS (top)
Verbal
protocols originally developed in 1889 as part of the cognitive interviewing
technique as an attempt to track mental processes. These introspections however
were attacked by behaviorists who claimed that this data was not reliable
because it could not be replicated. In the 1960s, as the field of cognitive psychology
expanded, verbal protocols began to make a comeback. In 1980, Ericsson and
Simon published Protocol Analysis: Verbal Reports as Data, in the Psychological
Review. In it, they proposed a theory on thinking aloud protocols and provided
substantial empirical support for it. Ever since, the think-aloud method has
been used to gather data. However, as researchers stray from Ericsson and
Simon’s theory, a threat to the validity and reliability of the verbal data is
introduced.
PURE THINK-ALOUD PROTOCOLS (top)
Ericsson
and Simon’s theory is based on the constructs of short-term (STM) and Long-term
memory (LTM) as they are described by the information-processing theory. Their
method is referred to as a pure think-aloud method, because of its defined
standards and strict conditions under which data is collected.
Ericsson
and Simon developed a model of verbalizations which includes three levels of
decreasingly reliable verbal reports.
These
are the most reliable because they are not transformed before being verbalized
during a task. Therefore, they are valid representations of the information
being attended to in STM. Rhenius and Deffner conducted an eye tracking study to answer whether
or not these verbalizations truly reflect concurrent thought. They compared
temporal sequences of verbal utterances to sequences of gazes directed at
different parts of the task display. They found an impressive overlap between
think-aloud data and eye-movement data to support Ericsson and Simon’s theory
about the reliability of this type of verbalization.
These
are transformed before verbalization; in other words, they involve description
or explication of the thought content. However, this explication is the only
mediating cognitive process between STM and verbalization. Ericsson and Simon
explain that “only the information attended to (i.e. held in STM) should be
verbalized, and that the recoding of this information for purposes of
vocalization should not otherwise alter the processing involved in the task
performance”(84).
These
require additional cognitive processing beyond what is required for task
performance or verbalization. Subjects actually explain their thought processes
or thoughts, and may require interpreting already attended-to information, or
retrieving information from LTM. Ericsson and Simon refer to this type of
verbalization as the least reliable, and suggest that researchers should try to
avoid it.
Ericsson
and Simon suggest that in order to achieve level 1 and level 2 verbalizations,
researchers should collect and analyze "hard" data only—what the
participant attends to and in what order, not participant introspection,
inference or opinion. They should also give detailed initial instructions for
thinking aloud by telling users to speak as if they are alone in the room.
Furthermore, researchers should remind users to think aloud using short,
nondirective reminders such as “keep talking”; otherwise, avoid any other
interaction with participants, including comments and questions.
LIMITATIONS OF THE PURE THINK ALOUD METHOD (top)
Although
Ericsson and Simon affirm that level 3 verbalizations are unreliable sources of
data, usability practitioners rely on this type. In order to collect certain
types of information about users’ preferences, confusing and effective
interface elements, and users’ perceptions, usability practitioners need to
probe by asking the users additional questions. Ericsson and Simon would argue
that “instructions requiring a subject to explain his thoughts may direct his
attention to his procedures, thus changing the structure of the thought
process” (78). Redirecting the user’s attention and changing the thought
process can be thought of as a type of experimenter bias, which can greatly
influence the reliability and validity of the data.
However,
because of the nature of the usability testing environment, testers are often
forced to intervene. As Boren points out, Ericsson and
Simon’s approach is "challenged by complex interfaces, self-conscious
participants, tight deadlines, and incomplete or buggy software or
prototypes" ([1], 266). If a participant encounters a bug in the
system, or if the system malfunctions, a tester has to intervene in order to
advance with the test. Also, if a user gets “stuck” and cannot move ahead, a
tester will intervene in order to prevent the user from getting frustrated, or
may even purposefully redirect the user to another part of the interface in
order to gather needed information. Level 3 verbalizations provide usability
practitioners with data necessary to enhance a product’s ease-of-use—“without
it, product evaluation would be less thorough and valid” (Wright and Converse,
1224).
Although
Ericsson and Simon’s theory is continuously referred to by prominent usability
practitioners, the theory may not be completely applicable to usability
testing. In experimental terms, human cognition is not what is being studied—it
is the test apparatus. While in diagnostic usability tests, an interface or
system is what is being studied—not the cognition of the participant (Boren[1], 272). “Transcriptions, word-for-word analysis, and
validation of cognitive models are simply not what most usability professionals
are after” (Boren[2], 53). In that case, it is not as
imperative to abide by Ericsson and Simon’s theory of relying on level 1 and
level 2 data only. However, usability engineers still need to agree on how to
conduct think-aloud protocols that aim to gather level 3 verbalizations, in a
way that reduces the threat to the data’s reliability and validity.
THINK-ALOUD METHODS IN PRACTICE (top)
Although
it is imperative to collect level 3 verbalizations during usability tests,
testers have to be aware that the types of prompts or comments they make can
greatly influence the validity and reliability of the data. However, Tamler shows that prominent practitioners have varying
opinions on the topic of tester intervention, ranging from completely “passive,
unobtrusive observation” to “full partnership with users” (11).
Intervention
can be broken down into two types: reminders to keep talking and probes used to
elicit additional information from users. According to Ericsson and Simon,
reminders, such as “keep talking” can be used without any threat to the
validity of the verbal reports. But other types of probes should not be used.
However, prominent usability practitioners use what Ericsson and Simon would
call “more intrusive” reminders and probes in order to gather useful
information. Boren notes that of roughly 125 interventions observed during his
study, only 16 of them were reminders to think aloud (about 15%). Other interventions were meant to direct the
participants to a particular area where feedback was needed, to clarify a participant’s
comment, to answer a participant’s question about the task, to help a
participant who is stuck, or to deal with software malfunction or an incomplete
prototype. Below, I will discuss the different types of interventions that are
being used during diagnostic usability tests and will analyze them for their
effect on the verbal data.
Ericsson
and Simon claim that this reminder is not intrusive. In fact, they claim that
“Reminders to verbalize of the ‘keep talking’ variety should have a very small,
if any, effect on the subject’s processing” (Ericsson and Simon, 83). However,
Boren disagrees. He conducted a study where he observed nine usability
engineers conducting tests at four different companies. During his
observations, Boren found that whenever engineers reminded participants to
think aloud (using Ericsson and Simon’s suggestion), participants apologized
for forgetting, retrospected about what they have
been doing prior to the reminder, speculated about what they would do next, or
stopped to reflect on what they are thinking before verbalizing it (Boren[2],
69). Boren claims that these reminders “indicate breaks in the normal task
flow, re-direction of attention, and level 3 verbalizations ([2], 70). Most
often, the “keep talking” reminder resulted in apologies from participants,
therefore interrupting the task flow and flustering users. Furthermore, Boren
noted that “In only 4 of the 16 cases where usability engineers reminded
participants in some way to remember to think aloud did the participant clearly
resume a concurrent report. And even those four do not completely exclude
elements of retrospection, speculation, or apology in addition to the
concurrent report” ([2], 75).
Ericsson
and Simon warn against intervention during a usability study. They suggest that
testers should not intervene except to remind users to think aloud. However,
from a speech communication perspective, silence can be distracting to the
participant. The speech communication theory acknowledges that “any time words
area spoken knowingly for another’s benefit, the roles of speaker and listener
exist. Both parties are aware of and are reactive to each other” (Boren[1], 267). Furthermore, speakers cannot ignore
listeners, even silent ones. “Speakers expect that listeners will react to what
they say, and that listeners’ actions (or inactions) are reflective of that
response” (Boren[1], 267). Therefore, silence from the
listener interspersed with commands to keep talking is an abrasive form of
contact ([1], 267). Boren’s field studies show that silence from the usability
practitioner seemed distracting to users, who apparently felt the need to
“check the connection.” The speech communication perspective suggests that
acknowledgment tokens can be used by usability practitioners to provide
participants with the response expected of engaged listeners, while still lying
low and promoting participant’s spearkership. While
tokens are natural, nonintrusive responses, silence
is an unnatural and potentially intrusive response which can even indicate the
presence of an uninterested or disengaged listener ([1], 271). Tokens, however,
reassure participants that they are being heard and listened to.
Acknowledgment
tokens include: oh, ah, mm hm, uh huh, ok, yeah, and
so on. Each of these is different in what it suggests to the user, and
therefore should be used carefully. Boren suggests that mm hm
or uh huh followed by an interrogative intonation are the most appropriate
tokens for usability tests ([1], 270). These tokens are continuers—they are not
intrusive or directive; however, they let participants know that they are being
listened to and encourage them to continue verbalizing their thoughts. Other
tokens, such as ‘ok’ or ‘yeah,’ have strong connotations of agreement, but,
they still may be useful to testers who want to agree with users without being
intrusive.
Boren
argues that since tokens carry no content, they should not introduce any new
content into the speaker’s STM, nor should they redirect participants’
attention. Furthermore, Boren suggests that token passing requires very little
to no processing, and acknowledges that usually, speakers move on with their
speech before tokens have been completed, resulting in an overlap with the
tokens.
“What are you thinking right now?”
This
question is suggested by Rubin and is widely used to remind participants to
speak when they fall silent, as an alternative to Ericsson and Simon’s more
abrasive, “keep talking.” Rubin also suggests using this question to gather
more information from users, or open their thought processes and feelings. For
example, if users sigh, grin or frown at the software, testers can use this
question to find out what their reaction means instead of making guesses.
Ericsson and Simon however, would argue that such a question may invite
opinion, evaluation or justification, instead of continued verbalizations of
thoughts as they are heeded; therefore, resulting in level 3 verbalizations.
“John, could you tell us why you pressed the
Enter key?”
Dumas
and Redish suggest using “neutral questions” like the
one above to answer more specific questions about the interface (281). Such
questions will help researchers gain an understanding of why users perform
certain actions or what elements of the interface are confusing or helpful.
These useful questions may not be answered unless usability practitioners
directly ask them. However, if analyzed, this question is intrusive, redirects
attention, elicits retrospection and disrupts the task flow, resulting in level
3 verbalizations.
Nielson
also suggests a similar question, "What do you think this message
means?" as a way to collect more data from participants. Similarly, this
question redirects the user’s attention to the message, forces the user to
analyze it and explain it, as well as disrupts the task flow. Likewise, this
question also results in level 3 verbalizations.
However,
the redirection of the users’ attention in both of these situations is required
in order for testers to answer specific questions about the interface. Testers
must rely on users making inferences or retrieving information from their
long-term memory in order to answer these questions. The purpose of the
usability tests is to find out if an interface is usable and intuitive to the
users. In many cases, this depends on the users’ knowledge and past experiences
with similar interfaces. Therefore, the knowledge located in their LTM may be
useful to the usability practitioner.
“Is there anything special that you’re
looking for?”
The
question above was reported by Boren as a common one during his field studies.
This question can be useful when reminding silent users to speak. However, such
a question can introduce bias in the data obtained. By asking users such a
closed question and implying that they are looking for something, users may
feel that they should be looking for something, and therefore start doing so.
“You seemed surprised/puzzled/frustrated,
were you?”
Boren
also discovered that some usability practitioners ask questions similar to the
one above in order to try and figure out what users are thinking, or remind
them to share their thoughts. Although those can be useful as reminders, they
can influence how participants respond and behave. Dumas and Redish warn against using these types of questions. They
argue that participants express themselves differently, and we, as
practitioners, take a risk by trying to guess what they are thinking (299).
Although participants may appear to be confused or surprised, they may just be
contemplating or thinking about what they are going to do next. However, the
use of specific words such as surprised,
frustrated and confused, may trick them into thinking that they are expected to
feel that way and actually impact or alter their behavior.
Dumas
and Redish warn against using the question above, and
suggest that usability practitioners should examine how they use adjectives and
adverbs in their questions. As an alternative question, they recommend “Was
this task easy or hard to perform?” That way, testers can avoid biasing the
participants to answer in a particular way, revealing how they feel about the
product, or putting words into their mouths.
“What features did you like and did not
like?”
Boren
also reported that this question was quite prevalent during usability tests.
However, practitioners usually ask participants for this type of information
during post-test interviews, after completion of tasks. In other words, they do
not interfere with the task performance, but instead ask the participants for
retrospective verbalizations. While the validity of retrospective
verbalizations is controversial, usability practitioners rely on them to gather
users’ preferences. Brinkman suggests that retrospective verbal reports are
less valid and are inferior to concurrent reports (1394). Other researchers
suggest that this type of data is flawed, because it can be affected by
reactivity. McGrath explains that if participants are aware that their behavior
is being done for the researcher’s purposes, and not their own, their responses
might be influenced by this knowledge. They may try to “make a good impression,
give socially desirable answers to help the researcher get the results being
sought…” (166). Furthermore, McGrath claims that these self-reports are
potentially flawed; however, he also agrees that they are a very useful form of
evidence (166).
Ericsson
and Simon recommend that if usability engineers seek additional information
from participants, that they should collect it in the form of retrospective
reports after the task to avoid any interruptions of task flow; therefore, they
can ensure that the information retrieved from concurrent think aloud is
reliable level 1 and 2 verbalization. Although retrospective reports are useful
when collected by usability practitioners after task completion, this does not
eliminate the need to probe during the task performance for clarifications of
actions, redirection, or users’ opinions. Such data must be collected during
task performance and not in the form of retrospective reports.
An
experiment by Bowers and Snyder was conducted to compare concurrent thinking
aloud to heavily cued retrospection in which users were presented with a video
tape of their performance and asked to recall their thoughts after task
completion. Bowers and Snyder found that the type of data collected under each
type of report differed significantly. Concurrent reports were more procedure
and reading statements, while retrospective reports were more explanations and
design statements. The authors conclude that “if a researcher is interested in
richer information such as explanation and design issues, retrospective verbal
protocols is the method of choice for verbal protocol collection” (1274). Also,
“the conditions under which the data are collected, such as time constraints,
as well as the questions that need to be addressed, should determine the
protocol to be employed” (1274).
The
methods of probing listed above are a sample of questions that prominent
usability practitioners use in the field, or recommend. As you can see, the questions and comments range in their length, their
level of intrusiveness, and their effectiveness. Some, depending on the
situation, can greatly affect the participants’ behavior and therefore threaten
the reliability of the data. Some are useful only in certain conditions or for
collecting a specific type of data. And some should be avoided because of the
bias they can introduce. Most importantly, the situation or the type of data
pursued determines what questions practitioners should ask. Rubin
advices that “even a sigh at the wrong time can influence the results and
render all or a portion of the results useless” (219). In that case,
practitioners have to be aware of the comments they make and questions they ask
and their potential effect on the data collected.
CONCLUSIONS AND RECOMMENDATIONS (top)
While
Ericsson and Simon’s theory provides usability practitioners with a foundation
for the think-aloud method, the theory must be modified in order to meet the
needs of usability engineers. To collect useful data during diagnostic
usability tests, practitioners must use level 3 verbalizations, which Ericsson
and Simon consider unreliable. Although this data may be unreliable if used to
report on cognitive processes, it can be reliable when used to determine the
ease-of-use of an interface and users’ preferences. However, certain types of
probes can greatly affect the validity and reliability of the verbal data.
Therefore, usability practitioners have to be aware of the possible threats to
the data caused by their intervention. Furthermore, they have to make decisions
about the specific types of data they are seeking and how to best collect it.
They need to consider which probes to use and how and when to use them. The first
step is to figure out what kind of information usability practitioners want,
whether raw cognitive processes, or user opinions and preferences. Then, find
out how to retrieve that information and assure its validity and reliability.
Further
research needs to be conducted to study the potential biasing and reactivity
caused by the different probes. Researchers should carefully study the
reactions of the participants and their responses for their validity and
reliability. Furthermore, an investigation of the reliability of retrospective
data also needs to be conducted.
REFERENCES (top)
[1] M. T. Boren and J. Ramey, “Thinking aloud:
Reconciling theory and practice,” IEEE Transactions of. Professional Communication,
vol. 43. 261-278, 2000.
[2] M. T. Boren, “Conducting
Verbal Protocols in Usability Testing: Theory and Practice,” Ph.D.
dissertation,
[3] K. A. Ericsson and H. A.
Simon, Protocol Analysis: Verbal Reports
as Data.
[4] H. Tamler.
(1998). “How (much) to intervene in a usability testing
session.” Common
Ground, vol. 8, no. 3, pp. 11-15, 1998.
[5] J. Rubin, Handbook of Usability Testing.
[6] J. S. Dumas and J. C. Redish, A Practical Guide to
Usability Testing.
[7]
R. B. Wright and S. A. Converse, “Method Bias and Concurrent Verbal Protocol in
Software Usability Testing,” in Proc.
Human Factors Soc. 36th Annu. Meet.,
[8] N. Ummelen
and R. Neutelings, “Measuring reading
behavior in policy documents: A comparison of two instruments.”
IEEE Transactions of. Professional Communication, vol. 43.
292-301, 2000.
[9]
V. Bowers and H. Snyder, “Concurrent vs. Retrospective Verbal Protocols for
Comparing Window Usability,” Proc. Human
Factors Soc. 34th Annu. Meet.,
[10] L. Van Waes,
“Thinking aloud as
a method for testing the usability of websites: The influence of task variation
on the evaluation of hypertext.” IEEE
Transactions of. Professional Communication, vol. 43. 279-291,
2000.
[11] D. Rhenius
and G. Deffner, “Evaluation of Concurrent Thinking
Aloud using Eye-tracking Data,” Proc.
Human Factors Soc. 34th Annu. Meet.,
[12] J. Held and D. Biers,
“Software Usability Testing: Do Evaluator Intervention and Task Structure Make
any Difference?” in Proc. Human Factors
Soc. 36th Annu. Meet.,
[13] J. Brinkman, “Verbal Protocol
Accuracy in Fault Diagnosis,” Ergonomics,
vol. 36, no. 11, 1993, pp. 1381-1397.