by
Gery Ryan
Fieldwork & Qualitative Data Laboratory
Department of Psychiatry & Biobehavioral
Sciences
UCLA School of Medicine
University of California Los Angeles
740 Westwood Plaza
Room C8-881, NPI
Los Angeles, CA 90095
Phone: 310-825-0890
Fax: 310-825-9875
Email: GRYAN@UCLA.EDU
Email: TWEISNER@UCLA.EDU
March 1, 1997
Please do not cite without authors' permission.
Typically, investigators use measures of intercoder agreement
as reliability and validity checks. High degrees of intercoder
agreement means that multiple coders are applying the codes in
the same manner and thus acting as "reliable"
measurement instruments. Coders who independently mark the same
text for a theme, provide evidence that a theme has external
"validity" and is not just a figment of the
investigator's imagination. In this article, I describe two
additional reasons for using multiple coders. Measures of
intercoder agreement and disagreement can be used to identify
core and periphery features of a theme and are useful for finding
prototypical or exemplary text.
Data Collection
During the summer of 1996, Thomas Weisner and I taught a
qualitative data analysis course to clinicians at the UCLA
medical school. As part of the course, we wanted participants to
actually collect, code, and analyze qualitative data in a
systematic manner. One of the substantive topics we explored was
clinicians' own past experiences with colds and flu. Such acute
and frequently occurring illnesses were something that everyone
had in common, and from past experience we knew it would provide
rich and varied data. We also thought it would be interesting to
see how doctors would describe their own illness experiences.
On the first day, we asked the participants to complete an
open-ended questionnaire. One section included the following
instructions: "In a couple of short paragraphs, please
describe the last time you had a cold or the flu." We
were intentionally vague about what to include in the
descriptions and did not prompt participants for particular types
of answers.
We collected responses from 23 clinicians in the class. In
all, the text totaled 1554 words (about 5 single-spaced pages).
For the next class, we made copies of all 23 descriptions and
gave them to the participants to read. While reading, we asked
them to keep a running list of themes and ideas that they
noticed. When they finished, we led a group discussion about the
themes they identified. From the discussion, we decided to focus
further on three of the emergent themes: 1) respondent's
perceptions of signs and symptoms, 2) their descriptions of how
the illness interrupted their daily activities, and 3) the
criteria they used to select treatments.
As a homework assignment, we asked participants to "Read
each illness description and mark blocks of continuous text where
informants mention any of the three themes." The
instructions about what counted as "blocks of continuous
text" was left intentionally vague so each coder had to
decide whether to mark whole sentences or just key phrases. We
did, however, explicitly tell them that they could mark text
units with multiple themes.
Since we needed to do the coding with paper and pencil, we
used a variety of simple marking conventions: signs and symptoms
were underlined with a straight line, interruptions of daily
routine were marked with a wiggly line, and decision criteria
marked with "<<" at the beginning of the block
and with ">>" at the end. We also had
participants record the exact time they started coding and the
exact time they finished. Below, I look at ten participants who
completed the coding task. On average, it took participants only
twenty minutes to read five pages and code for three themes.
Data Management
Assessing intercoder agreement requires each coder's marking
behavior be placed in a standard format. The trick is to consider
text as a long list of words that can be converted into a simple
matrix. Each word represents a single row in the matrix and is
described by its structural relationship to a particular
sentence, paragraph, and informant number and by its relation
with each coded theme.
Describing text in a matrix format does not in any way reduce
the potential for interpretive analysis nor make text analysis a
mechanical procedure. Interpretation occurs when investigators
identify new themes (thus creating new columns in a matrix) and
when investigators associate text passages with these themes
(thus assigning values to the appropriate cells). The fact that I
can "back translate" the matrix and reproduce exactly
what each coder did when he or she marked up the text indicates
that data is not lost is the translation procedure. Describing
text as a matrix, however, allows for comparisons across words,
sentences, themes, questions, domains, informants, ethnic groups,
and coders (Miles & Huberman 1994, Roberts 1997).
In order to move from the paper coding to a matrix of words, I
went through a number of steps. First, I made ten identical word
processing files of the text--one for each coder. Then I used a
set of macros in Microsoft Word to transfer each coder's marking
into their own electronic file (Ryan 1996). (Of course, it would
have been easier if each coder had marked the text on the
computer in the first place, but this was not feasible at the
time) The macros insert coding tags, similar to those described
by Truex (1993), at the beginning and end of specific blocks of
text. Figure 1 shows some examples of the marking conventions.
Figure 1 Coding of signs and symptoms by coders 1 and 2 for one illness description.
Coder |
Coded Text for Informant 1 |
|
1 |
>>S/S|| I was uncomfortable with fevers associated with myalgrams, headaches, and a sore throat. ||S/S<< >>Int|| I was unable to do my usual activities such as exercising, keeping up with routine paperwork. I could not enjoy social events and went to works but was miserable. ||Int<< It lasted about ten days and I recovered. | |
2 |
I was uncomfortable with fevers associated with >>S/S|| myalgrams, headaches, ||S/S<< and a >>S/S|| sore throat. ||S/S<< I was unable to >>Int|| do my usual activities such as exercising, keeping up with routine paperwork. I could not enjoy social events and ||Int<< >>Dec|| went to works ||Dec<< but was >>S/S|| miserable. ||S/S<< It lasted about ten days and I recovered. |
Once the tags were embedded in the file, I created a program
in Visual Basic to convert each text file into a matrix (in this
case a Microsoft Access database). Each matrix (or database) had
1554 rows/records (one for each word in the text) and 9
columns/fields. The first five columns were filled with variables
that characterized each word and were constant for all coders.
These included the word identification number (1-1154), the
sentence number (1-107), the respondent number (1-23), the number
of times each word form appeared in the text (between 1 and 129
times), and whether each word belonged to a list of common words
that included particles, prepositions, and pronouns (0-1). [Word
frequency counts and common-word lists are standard techniques in
content analysis (Krippendorf 1980, Weber 1990) and have been
reviewed in CAM by Ryan & Weisner (1996).] The remaining
three columns were filled with 1's and O's to indicated whether
the coder had marked the particular word as pertaining to one of
the respective themes or not.
Since I wanted to analyze intercoder agreements for each theme
separately, I merged the ten coders' matrices into three
theme-oriented matrices, one for signs/symptoms, one for
interruption of daily routine, and one for decision criteria. As
before, each matrix had 1554 rows and the first six columns
contained structural data describing each word. The next ten
columns contain 1's and 0's to indicate whether each of the ten
coders marked the word or not.
Intercoder Pairs
One way to describe the central and peripheral aspect of a
theme is to examine the agreement between pairs of
coders. Since the texts was coded by ten coders, I selected a
pair of coders (1 & 2) at random and divided their data into
four parts for each theme. Figure 2 shows the partial results for
the signs/symptom (S/S) theme. All the text that Coder 1 marked
as S/S, but Coder 2 did not, appears in the column headed "1
Only." All text that both Coder 1 and Coder 2
marked as S/S appears in the column headed "1 & 2."
All text that Coder 2 marked but Coder 1 did not as S/S appears
in the column headed "2 Only." The last column is
filled with the remaining text (the text not marked by either
Coder 1 nor 2 as S/S). The numbers in Figure 2 indicate how the
original text was broken apart. For example, the first phrase in
column 1, "I was uncomfortable with fevers associated
with1", is connected to the first phrase in
column 2, "1myalgrams, headaches,2."
Figure 2 Intercoder agreement and disagreement about signs and symptoms for three illness descriptions with all words presented.
Episode |
Coders |
|||||||
ID |
1 Only |
1 & 2 |
2 Only |
Neither 1 nor 2 |
||||
1 |
I was uncomfortable with fevers associated with 1 2 and a 3 | 1 myalgrams, headaches, 2 3 sore throat 4 | 5 miserable. 6 | 4 I was unable to do my usual activities such as exercising, keeping up with routine paperwork. I could not enjoy social events and went to works but was 5 6 lasted about ten days and I recovered. | ||||
2 |
3 I 4 6 I was 7 8 After the 9 10 I had an 11 12 for two weeks which was em-barrassing at times. 13 14 Although I 15 16 I could not control. 17 | 4 felt tired, 5 7 very congested. 8 9 congestion cleared, 10 11 annoying dry cough 12 15 felt fine, 16 17 the coughing during the play. 18 | 1 a cold 2 19 continued to cough. | I had 1 2 of moderate severity. 3 5 but I had to work. 6 13 I had tickets to a play that I had looked forward to for two months. 14 18 I bought two kinds of cough drops and cough syrup but still 19 | ||||
3 |
1 usual, more 2 3 needed to nap several times during day. Had a 4 5 and progressive difficulty working all day. 6 | Energy level felt lower than 1 2 easily tired, 3 4 productive cough 5 | 6 Felt like probable bronchitis and arranged for antibiotics. Recovery noted after several weeks. | |||||
Intercoder agreement (the text in the column labeled "1
& 2" ) shows us the core features for a theme. In Figure
1, core signs and symptoms include myalgrams, headaches, sore
throat, tired, contested/congestion, cough/coughing, and lower
energy levels.
In contrast, intercoder disagreement (the text in the columns
labeled "1 Only" and "2 Only") shows the
theme's peripheral features. In Figure 1, peripheral features
include subthemes associated with general discomfort and hassle (uncomfortable,
miserable, could not control, needed to nap,
and difficulty working all day) and time (two weeks,
after the, continued to, at times, several times
during the day, and all day). Similar words and
phrases are relatively uncommon in the core features, suggesting
that they may be systematic differences rather than the results
of one or another coder forgetting to code a particular phrase.
Since there is more text in the column "1 Only" than
there is in "2 Only," I can see that Coder 1 has a
tendency to mark more text than does Coder 2. It turns out that
Coder 1 always marked the entire sentence while Coder 2 marked
exact phrases. These two approaches have different advantages.
Marking phrases provides a narrower and more concise summary of
the theme, while marking sentences broadens the theme to less
obvious aspects of signs and symptoms such as associations with
discomfort and time.
Coders also agreed about what text should not be marked
as pertaining to signs and symptoms. The text in the column
marked "Neither 1 nor 2" is indicative of the extreme
boundaries of the theme. For instance, neither of these two
coders felt that the phrase "of moderate severity"
pertained to signs and symptoms, yet one of the coders felt that
"a cold" did. The "left over" text is also a
good place to look for additional themes. For example, much of
the last column pertains to the interruption of daily routine and
treatments. It is also a good place to look for phrases that both
coders may have failed to mark.
Multiple Coders
With only two coders, the prototypicality of responses
pertaining to any theme is difficult to assess. With multiple
coders, however, the task is a little easier. First, I calculated
the intercoder word frequency -- i.e., the number of
times that the ten coders marked each word. The numbers ranged
from zero (no coders marked the word) to ten (all coders marked
the word). I assume that the more coders who identify words and
phrases as pertaining to a given theme, the more prototypical the
text.
I built a set of programs to read a theme's matrix and
identify words that: a) at least one coder had marked as
pertaining to the theme (intercoder word frequency >0); and b)
did not belong to the common word list. (This can also be done in
the database program Access using the structured
query language.) Once I identified the key text for the
theme, I formatted the output based on the intercoder word
frequencies. I printed the text marked by all ten coders in 20
point font, the text marked by nine coders in 18 point font, the
text marked by eight coders in 16 point font, and so on. The
bigger the font, the more prototypical the text. Think of large
fonts as being closer to the conceptual bull's eye. (Instead of
font size, I could have used variations in color or background
shading.)
I also created a mirror image of the prototypicality output. I
printed the text marked by one coder in 20 point font, the text
marked by two coders in 18 point font, the text marked by three
coders in 16 point font, and so on. In this case, the larger the
font, the more periphery the text is to the theme.
Figure 3 Intercoder agreement as represented by font size for two illness descriptions.
ID |
Prototypicality |
Mirror of Prototypicality |
1 |
uncomfortable fevers associated myalgrams, headaches, sore throat. unable usual activities not enjoy social events miserable. | uncomfortable fevers associated myalgrams, headaches, sore throat. unable usual activities not enjoy social events miserable. | ||
2 |
cold moderate severity. felt tired, work. congested. congestion cleared, annoying dry cough weeks embarrassing times. Although felt fine, could not control coughing during play. still continued cough. | cold moderate severity. felt tired, work. congested. congestion cleared, annoying dry cough weeks embarrassing times. Although felt fine, could not control coughing during play. still continued cough. |
Figure 3 shows the two display techniques side-by-side. In the
figure's second example, the core concepts related to signs and
symptoms are tired, congested, and cough
followed by concepts related to felt, congestion, cleared,
annoying, and to some extent weeks. The mirror
image, shows that moderate severity, embarrassing, times, and
felt fine, occupy more periphery position within the
theme. By juxtaposing core and periphery concepts, I have a more
sophisticated manner for describing abstract themes.
The techniques describe above are useful for displaying the
prototypicality of words in context, but for large corpuses of
text such techniques are not very practical for describing
general patterns. In such cases, an even more concentrated format
is needed. I combined basic techniques from content analysis to
list the words that coders agree belong to a given theme. Figure
4 shows the most prototypical words associated with signs and
symptoms. The first column of words are those that all ten coders
agreed belonged to the S/S theme. These words are ranked
according to how often each word appeared in the text. For
example, the word throat occurred 14 times in the text,
and in at least one occurrence, all ten coders agreed the word
pertained to the S/S theme. Unlike classic content analysis that
associates high frequency words with theme salience, this
technique identifies words that are pertinent to a theme but may
have low frequencies. For example, coders always associated shaking
and sweats with signs and symptoms, even though both
words only occurred once in the illness descriptions.
Figure 4. Word frequency ranked by intercoder agreement (common words eliminated).
Intercoder Agreements |
||||||||||||||
10 |
9 |
8 |
7 |
6 |
||||||||||
Freq. Word |
Freq. Word |
Freq. Word |
Freq. Word |
Freq. Word |
||||||||||
24 |
feel(ing, felt) | 14 |
last(ed, ing) | 33 |
day(s) | 17 |
work | 8 |
because | |||||
14 |
throat | 8 |
not | 6 |
slept | 10 |
home | 7 |
ago | |||||
12 |
sore | 7 |
developed | 3 |
lot | 9 |
so | 6 |
flu | |||||
11 |
cough(ing) | 4 |
mild | 2 |
second | 9 |
weeks | 6 |
next | |||||
11 |
fever(s, ish) | 4 |
started | 1 |
associated | 5 |
symptoms | 4 |
just | |||||
6 |
congest(ed,ion) | 4 |
well | 1 |
clogged | 3 |
before | 3 |
didnt | |||||
6 |
tired | 3 |
bedridden(bed) | 1 |
extremely | 3 |
began | 3 |
during | |||||
4 |
achy(es,ing) | 3 |
degrees | 1 |
immediately | 3 |
several | 3 |
few | |||||
4 |
nasal | 3 |
did | 1 |
nap | 2 |
end | 3 |
morning | |||||
3 |
chills | 3 |
usual | 1 |
needed | 2 |
friday | 2 |
got | |||||
3 |
nose | 2 |
clear(ed) | 2 |
night | 1 |
typical | |||||||
3 |
over | 2 |
horrible | 2 |
really | |||||||||
2 |
fatigue(d) | 2 |
pretty | 2 |
staying | |||||||||
2 |
headache(s) | 2 |
severe | 2 |
still | |||||||||
2 |
level | 1 |
alternately | 2 |
third | |||||||||
2 |
low(er) | 1 |
annoying | 2 |
times | |||||||||
2 |
malaise | 1 |
appeared | 1 |
concentration | |||||||||
2 |
mydrias(grams) | 1 |
body | 1 |
difficulty | |||||||||
2 |
productive | 1 |
light | 1 |
evening | |||||||||
2 |
rhinorrhea | 1 |
lymphadenopatry | 1 |
experienced | |||||||||
2 |
sneezing | 1 |
minimal | 1 |
followed | |||||||||
2 |
vomiting | 1 |
muscles | 1 |
high | |||||||||
2 |
whole | 1 |
progressive | |||||||||||
1 |
discharge | 1 |
spent | |||||||||||
1 |
drip | 1 |
subsequently | |||||||||||
1 |
dry | 1 |
uncomfortable | |||||||||||
1 |
easily | 1 |
waning | |||||||||||
1 |
energy | 1 |
weak | |||||||||||
1 |
generalized | |||||||||||||
1 |
grade | |||||||||||||
1 |
headedness | |||||||||||||
1 |
lethargic | |||||||||||||
1 |
malcongestion | |||||||||||||
1 |
mydrias | |||||||||||||
1 |
nonproductive | |||||||||||||
1 |
post | |||||||||||||
1 |
runny | |||||||||||||
1 |
shaking | |||||||||||||
1 |
sweats | |||||||||||||
1 |
than | |||||||||||||
The set of words on which all informants agreed tend to be
related to physiological indicators. These words occupy the core
part of the sign/symptom construct. The second set has a number
of words related to severity (e.g., mild, degrees,
horrible, several, annoying, minimal),
suggesting that evaluations of illnesses might be an important
subtheme. Sets further to the right indicate peripheral
subthemes. For example, words related to time (e.g., days,
ago, weeks, before, end, night,
morning, followed, next), and
behaviors (e.g., slept, nap, work, concentration)
are found in the last three columns.
Measuring the Prototypicality of Quotes
The word analysis techniques described above are useful for
depicting the range and central tendency of a theme. Researchers,
however, typically want to use prototypical examples and quotes
in their descriptions. Two problems arise: a) How does the
investigator identify the most prototypical quotes?; and b) How
can a critic or a reviewer be sure that the selected quotes or
examples are indeed representative of the text being analyzed?
Figure 5. Sentences related to signs and symptoms sorted by weighted prototypicality scores. | ||||||||||||
Prototypicallity |
||||||||||||
Total |
Weighted |
|||||||||||
ID |
Raw |
Score |
Rank |
Score |
Rank |
Sentence | ||||||
11 |
10 |
40 |
36 |
10.0 |
1 |
Sore throat, rhinorrhea, mydrias. | ||||||
10 |
10 |
44 |
34 |
9.8 |
2 |
Minimal sore throat, nasal discharge. | ||||||
17 |
10 |
72 |
25 |
9.0 |
3 |
My nose was congested and I was sneezing. | ||||||
22 |
10 |
116 |
10 |
8.9 |
4 |
I had a headache, post nasal drip, low grade fever and felt horrible. | ||||||
13 |
10 |
79 |
22 |
8.8 |
5 |
I felt fatigued and experienced generalized malaise and myalgias. | ||||||
8 |
10 |
52 |
33 |
8.7 |
6 |
Subsequently congestion and mild cough appeared. | ||||||
4 |
10 |
137 |
6 |
8.6 |
7 |
Energy level felt lower than usual, more easily tired, needed to nap several times during day. | ||||||
17 |
10 |
42 |
35 |
8.4 |
8 |
I felt lethargic and tired. | ||||||
12 |
10 |
134 |
7 |
8.4 |
9 |
I started coughing and then developed some malcongestion: a sore throat as well as a fever. | ||||||
2 |
10 |
107 |
13 |
8.2 |
10 |
I was uncomfortable with fevers associated with myalgrams, headaches, and a sore throat. | ||||||
19 |
10 |
171 |
4 |
8.1 |
11 |
On Friday night I developed a fever of 103 degrees, and spent that night in bed alternately with chills and sweats. | ||||||
6 |
10 |
341 |
1 |
8.1 |
12 |
It began for me with a mild sore throat and a mild fever but by the second day my fever was up to 101 degrees F. by the third day up to 103 degrees F. and my sore throat was pretty severe. | ||||||
3 |
10 |
32 |
40 |
8.0 |
13 |
I was very congested. | ||||||
21 |
10 |
104 |
15 |
8.0 |
14 |
I was congested, achy and feverish with the symptoms lasting about 4-5 days. | ||||||
9 |
10 |
198 |
2 |
7.9 |
15 |
Five weeks ago I developed sore throat that lasted two days followed by about ten days of nasal congestion, clear rhinorrhea and mild nonproductive cough. |
Drawing from our previous analysis of words, I calculate three
measures of sentence prototypicality. Figure 5 shows the
sentences that are the most representative of signs and symptoms
for our ten coders. I calculated a raw agreement score
by counting the number of coders that had marked at least one
word in a sentence as pertaining to the S/S theme. The raw
agreement score consists of integers from zero to ten and is
simple to explain. From a practical perspective, however, this
score was not very helpful for our purposes as it produced a lot
of ties and tended to give higher scores to longer sentences.
Next I calculated a total agreement score by counting
the number of coders that had marked each word and then summing
the counts across all the words in a sentence. The minimum score
was zero and the maximum score was the number of coders (in this
case ten) multiplied by the number of words in the longest
sentence (in this case 42). The range of score values was much
higher than the raw agreement scores, thus allowing for finer
distinctions between sentences. Although longer sentences still
had a better chance of scoring higher than did shorter sentences,
high scores identified sentences containing the most signs and
symptoms.
Finally I calculated a weighted agreement score by
taking the total agreement score for a sentence and dividing it
by the number of words in the sentence. Scores ranged from zero
to ten (the total number of coders) and can be considered a
measure of a sentence's potency in regards to a theme. Those
sentences that score high tend be extremely pithy in regards to
the S/S theme. For example, the first sentence in Figure 5,
"Sore throat, rhinorrhea, mydrias" has a weighted
agreement score of ten because all ten coders marked all four
words as pertaining to the S/S theme.
By ranking the sentences according to total agreement scores,
I can identify those sentences that coders not only agreed
pertained to the theme, but also contained the most prototypical
words. By ranking the sentences according to the weighted
agreement scores, I identify prototypical but pithy sentences
that pertain to the theme. In general, the two types of scores
are quite similar. In fact, I find that the scores on all 107
sentences are correlated at r = 0.77 (p< 0.001), and the
correlation of the ranks is 0.85 (p<0.001).
Discussion
Intercoder agreement has a variety of functions in the
analysis of text.
Intercoder-agreement-as-reliability:
Agreement between coders tells investigators the degree to which
coders can be treated as "reliable" measurement
instruments (e.g., Carey et al. 1996). High degrees of intercoder
reliability means that multiple coders are applying the codes in
the same manner. If one coder marks half of the data and another
coder marks the other half, investigators need to know that both
coders are performing more or less the same tasks. Normally the
reliability test is done by having both coders independently code
a sample of the entire text.
Intercoder reliability is particularly important if the coded
data will be analyzed statistically. If coders disagree, then the
coded data are inaccurate. Discrepancies between coders is
considered error and affects analysis calculations. One of the
advantages of using content analysis and word dictionaries (Stone
et al. 1966) is that such techniques are 100% reliable because
they always mark exactly the same text. Of course, content
analysis can not judge the more subtle meaning of statements as
human coders can.
Intercoder-agreement-as-reliability also is important for text
retrieval tasks. After coding, researchers usually want to search
through their texts and find examples of a particular code. An
investigator who uses a single coder to mark themes, relies on
the coder's ability not to miss examples. Having multiple coders
mark a text increase the likelihood of finding all the
examples in a text that pertain to a given theme.
Intercoder-agreement-as-validity: Mitchell (1979) noted that most qualitative analysis use intercoder-agreement to measure construct validity rather than measurement reliability.
Demonstrating that multiple coders can pick the same text as
pertaining to a theme shows that the theme is not just a figment
of the primary investigator's imagination. Validity could be
further increased if informants (rather than investigators) acted
as coders.(1)
Intercoder-agreement-as-construct-definition:
Theme identification and definition is part of the inductive
research process. It begins when investigators try to define the
themes that they find emerging from their texts. After reading
over the corpus, team members discuss what constructs or themes
they want to examine, and what kind of things "count
as" a particular construct. This is the process of building
a codebook or theme list.
As Lynn Richards, one of the authors of the text management
software Nudist, said in a note to a qualitative list server:
"the central qualitative process is usually seeing and
discussing diverging interpretations of codes. I print summary
reports, pin them on the wall and talk!" When developing or
refining codebook definitions, the differences among researchers'
interpretations are not measured systematically, nor should they
be. The idea for the investigative unit is to come to some
agreement as to what "counts as the construct." In most
cases, the constructs are "fuzzy" and require
prototypical examples rather than strict logical definitions.
In this article, I have suggested two additional ways to use
intercoder agreement. First, intercoder agreement measures can be
used as a tool to systematically describe the range, central
tendency, and distribution of responses within a theme. They are
particularly useful for identifying gradations of core and
periphery structures within abstract constructs. Second,
intercoder agreement is a measurement device for identifying and
ranking prototypical quotes from informants. Critics and
reviewers often want to know to what degree the quotes and
examples used by investigators are indeed representative of
informant's texts. Intercoder agreement measurements provide an
answer.
Addendum: How many coders is enough?
By far the most common response to reviews of this article has
been, "But I can't afford to use ten coders. How many is
enough?" It is a fair, but tough question. Below I give some
general thoughts on the matter. The answer seems to depend on: a)
the ability of the coder to identify themes, b) the
core/periphery dispersion of the theme, c) the number of times
that any given theme appears in the text, and d) the levels of
specificity investigators wish to achieve.
The last two constraints are similar to the sampling problems
Bernard and Killworth (1993) solved for time allocation research.
They showed that the rarer an event's occurrence in a population,
the more you have to sample to assure you find it with any
confidence. They also showed that sample size depends on whether
you want to be sure of identify at least one occurrence of an
event, or you wanted to know the frequency of an events
occurrence in the population within a particular confidence
interval.
In the case of text analysis, (unlike time allocation), the
population is known -- its the entire corpus of text that has
been collected. The unknowns are the rate of a theme's occurrence
and each coder's ability to identify the theme when it occurs. It
stands to reason that if a theme occurs a lot, a single coder is
likely to find at least one example of the theme, even if the
coder is not very good at identifying themes. If the theme occurs
rarely, however, the likelihood of finding a single example
decreases. It decreases even more, if the coder isn't very good.
Unfortunately, investigators usually are willing to miss a few
examples of a theme that occurs a lot but can't afford to miss
any examples of a theme that occurs rarely. It makes sense,
therefore, that the rarer a theme's occurrence and the more
important it is to find all occurrences, the more coders you want
to look for it.
The number of coders needed to identify aspects of
core/periphery structures in constructs depends on level of
distinctions an investigator wants to make. I see themes and
abstract constructs as targets made up of concentric circles. The
more coders you add, the more circles you have in the target.
With a single coder, you can not distinguish between core and
periphery features of a theme. Table 2 shows what can be learned
about core/periphery features with just two coders, and Table 4
shows the kinds of distinctions that can be made with ten coders.
In hindsight, I can use the multiple coders to calculate the
probability of any one coder associating a single word with a
particular theme. Table 4 shows that any single coder would
probably have associated any of the words in the first column
with the S/S theme. The single coder would have had a 90%
probability of identifying those words in the second column, an
80% probability for those in the third column, and so forth. It
becomes apparent that investigators interested in confidently
identifying and describing the peripheral aspects of a theme will
probably want to use multiple coders.
Likewise, since all the quotes shown in Table 5 were marked by
all the coders, I can assume that any single coder would have
found them. Of course, a single coder would have also identify
quotes that were less prototypical as well. With a single coder,
however, there would be no way to separate less prototypical
quotes from the more prototypical quotes. Increasing the number
of coders will not help find more core quotes, it will allow
investigators to distinguish among quotes in a replicable manner.
It also seems reasonable to assume that the less well defined a
construct, the more coders are needed to describe it in detail.
I find it helpful to recognize the limitations and advantages
of single and multiple-coder research. It seems plausible that
for some tasks investigators can rely on a single coder and for
other tasks they should use multiple coders. Ultimately, it is
the investigator's responsibility to identify the goals of the
research and determine what kind of coding is required.
Acknowledgments
I would like to thank Kathleen MacQueen for steering us toward
the problem of identifying and measuring prototypicality and
Russell Bernard for his insights into potential solutions to this
problem. I would also like to thank the Clinical Scholars who
collected and coded the texts.
References
Bernard, H. Russell and Peter D. Killworth
1993 Sampling in time allocation research. Ethnology
32:207-215.
Carey, James W., Mark Morgan and Margaret J. Oxtoby
1996 Intercoder agreement in analysis of responses to
open-ended interview questions: Examples from tuberculosis
research. Cultural Anthropology Methods Journal 8(3):1-5.
Krippendorff, Klaus
1980 Content analysis: An introduction to its methodology.
Beverly Hills: Sage Publications.
Miles, Matthew B. and A. Michael Huberman
1994 Qualitative data analysis: an expanded sourcebook. 2nd
ed. Thousand Oaks, CA: Sage Publications.
Mitchell, Sandra K.
1979 Interobserver agreement, reliability, and
generalizability of data collected in observational studies.
Psychological Bulletin 86:376-390.
Roberts, Carl W. 1997. A theoretical map for selecting among
text analysis methods. In: Text Analysis for the Social Sciences:
Methods for Drawing Statistical Inferences from Texts and
Transcripts. Carl W. Roberts, ed., NJ: Lawrence Erlbaum
Associates. pp. 275-283.
Ryan, Gery
1996 Fieldnote Searcher, 1.0. Los Angles: Fieldwork &
Qualitative Data Laboratory, UCLA.
Ryan, Gery and Thomas Weisner
1996 Analyzing words in brief descriptions: Fathers and
mothers describe their children. Cultural Anthropology Methods
Journal 8(3):13-16.
Stone, Philip J., M. S. Dunphy and D. M. Ogilvie
1966 The general inquirer: A computer approach to content
analysis. Cambridge: MIT Press.
Truex, Gregory F.
1993 Tagging and typing: Notes on codes in anthropology.
Cultural Anthropology Methods Journal 5(1):3-5.
Weber, Robert Philip
1990 Basic content analysis. 2nd ed. Newbury Park, CA: Sage
Publications.
Endnotes
1. I would like to thank Roy D'Andrade for
this suggestion.