Measuring the prototypicality of text:
Using intercoder agreement for more than just reliability and validity checks

by

Gery Ryan
Fieldwork & Qualitative Data Laboratory
Department of Psychiatry & Biobehavioral Sciences
UCLA School of Medicine
University of California Los Angeles
740 Westwood Plaza
Room C8-881, NPI
Los Angeles, CA 90095


Phone: 310-825-0890
Fax: 310-825-9875
Email: GRYAN@UCLA.EDU
Email: TWEISNER@UCLA.EDU

 

March 1, 1997

Please do not cite without authors' permission.

 

Typically, investigators use measures of intercoder agreement as reliability and validity checks. High degrees of intercoder agreement means that multiple coders are applying the codes in the same manner and thus acting as "reliable" measurement instruments. Coders who independently mark the same text for a theme, provide evidence that a theme has external "validity" and is not just a figment of the investigator's imagination. In this article, I describe two additional reasons for using multiple coders. Measures of intercoder agreement and disagreement can be used to identify core and periphery features of a theme and are useful for finding prototypical or exemplary text.

Data Collection

During the summer of 1996, Thomas Weisner and I taught a qualitative data analysis course to clinicians at the UCLA medical school. As part of the course, we wanted participants to actually collect, code, and analyze qualitative data in a systematic manner. One of the substantive topics we explored was clinicians' own past experiences with colds and flu. Such acute and frequently occurring illnesses were something that everyone had in common, and from past experience we knew it would provide rich and varied data. We also thought it would be interesting to see how doctors would describe their own illness experiences.

On the first day, we asked the participants to complete an open-ended questionnaire. One section included the following instructions: "In a couple of short paragraphs, please describe the last time you had a cold or the flu." We were intentionally vague about what to include in the descriptions and did not prompt participants for particular types of answers.

We collected responses from 23 clinicians in the class. In all, the text totaled 1554 words (about 5 single-spaced pages). For the next class, we made copies of all 23 descriptions and gave them to the participants to read. While reading, we asked them to keep a running list of themes and ideas that they noticed. When they finished, we led a group discussion about the themes they identified. From the discussion, we decided to focus further on three of the emergent themes: 1) respondent's perceptions of signs and symptoms, 2) their descriptions of how the illness interrupted their daily activities, and 3) the criteria they used to select treatments.

As a homework assignment, we asked participants to "Read each illness description and mark blocks of continuous text where informants mention any of the three themes." The instructions about what counted as "blocks of continuous text" was left intentionally vague so each coder had to decide whether to mark whole sentences or just key phrases. We did, however, explicitly tell them that they could mark text units with multiple themes.

Since we needed to do the coding with paper and pencil, we used a variety of simple marking conventions: signs and symptoms were underlined with a straight line, interruptions of daily routine were marked with a wiggly line, and decision criteria marked with "<<" at the beginning of the block and with ">>" at the end. We also had participants record the exact time they started coding and the exact time they finished. Below, I look at ten participants who completed the coding task. On average, it took participants only twenty minutes to read five pages and code for three themes.

Data Management

Assessing intercoder agreement requires each coder's marking behavior be placed in a standard format. The trick is to consider text as a long list of words that can be converted into a simple matrix. Each word represents a single row in the matrix and is described by its structural relationship to a particular sentence, paragraph, and informant number and by its relation with each coded theme.

Describing text in a matrix format does not in any way reduce the potential for interpretive analysis nor make text analysis a mechanical procedure. Interpretation occurs when investigators identify new themes (thus creating new columns in a matrix) and when investigators associate text passages with these themes (thus assigning values to the appropriate cells). The fact that I can "back translate" the matrix and reproduce exactly what each coder did when he or she marked up the text indicates that data is not lost is the translation procedure. Describing text as a matrix, however, allows for comparisons across words, sentences, themes, questions, domains, informants, ethnic groups, and coders (Miles & Huberman 1994, Roberts 1997).

In order to move from the paper coding to a matrix of words, I went through a number of steps. First, I made ten identical word processing files of the text--one for each coder. Then I used a set of macros in Microsoft Word to transfer each coder's marking into their own electronic file (Ryan 1996). (Of course, it would have been easier if each coder had marked the text on the computer in the first place, but this was not feasible at the time) The macros insert coding tags, similar to those described by Truex (1993), at the beginning and end of specific blocks of text. Figure 1 shows some examples of the marking conventions.

Figure 1 Coding of signs and symptoms by coders 1 and 2 for one illness description.

Coder

 

Coded Text for Informant 1

1

  >>S/S|| I was uncomfortable with fevers associated with myalgrams, headaches, and a sore throat. ||S/S<< >>Int|| I was unable to do my usual activities such as exercising, keeping up with routine paperwork. I could not enjoy social events and went to works but was miserable. ||Int<< It lasted about ten days and I recovered.

2

  I was uncomfortable with fevers associated with >>S/S|| myalgrams, headaches, ||S/S<< and a >>S/S|| sore throat. ||S/S<< I was unable to >>Int|| do my usual activities such as exercising, keeping up with routine paperwork. I could not enjoy social events and ||Int<< >>Dec|| went to works ||Dec<< but was >>S/S|| miserable. ||S/S<< It lasted about ten days and I recovered.

Once the tags were embedded in the file, I created a program in Visual Basic to convert each text file into a matrix (in this case a Microsoft Access database). Each matrix (or database) had 1554 rows/records (one for each word in the text) and 9 columns/fields. The first five columns were filled with variables that characterized each word and were constant for all coders. These included the word identification number (1-1154), the sentence number (1-107), the respondent number (1-23), the number of times each word form appeared in the text (between 1 and 129 times), and whether each word belonged to a list of common words that included particles, prepositions, and pronouns (0-1). [Word frequency counts and common-word lists are standard techniques in content analysis (Krippendorf 1980, Weber 1990) and have been reviewed in CAM by Ryan & Weisner (1996).] The remaining three columns were filled with 1's and O's to indicated whether the coder had marked the particular word as pertaining to one of the respective themes or not.

Since I wanted to analyze intercoder agreements for each theme separately, I merged the ten coders' matrices into three theme-oriented matrices, one for signs/symptoms, one for interruption of daily routine, and one for decision criteria. As before, each matrix had 1554 rows and the first six columns contained structural data describing each word. The next ten columns contain 1's and 0's to indicate whether each of the ten coders marked the word or not.

Intercoder Pairs

One way to describe the central and peripheral aspect of a theme is to examine the agreement between pairs of coders. Since the texts was coded by ten coders, I selected a pair of coders (1 & 2) at random and divided their data into four parts for each theme. Figure 2 shows the partial results for the signs/symptom (S/S) theme. All the text that Coder 1 marked as S/S, but Coder 2 did not, appears in the column headed "1 Only." All text that both Coder 1 and Coder 2 marked as S/S appears in the column headed "1 & 2." All text that Coder 2 marked but Coder 1 did not as S/S appears in the column headed "2 Only." The last column is filled with the remaining text (the text not marked by either Coder 1 nor 2 as S/S). The numbers in Figure 2 indicate how the original text was broken apart. For example, the first phrase in column 1, "I was uncomfortable with fevers associated with1", is connected to the first phrase in column 2, "1myalgrams, headaches,2."

Figure 2 Intercoder agreement and disagreement about signs and symptoms for three illness descriptions with all words presented.

Episode

Coders

ID

 

1 Only

 

1 & 2

 

2 Only

 

Neither 1 nor 2

1

  I was uncomfortable with fevers associated with…1 2…and a…3   1…myalgrams, headaches,…2 3…sore throat…4   5…miserable. …6   4…I was unable to do my usual activities such as exercising, keeping up with routine paperwork. I could not enjoy social events and went to works but was…5 6…lasted about ten days and I recovered.

2

  3…I …4 6…I was…7 8…After the…9 10…I had an…11 12…for two weeks which was em-barrassing at times.…13 14…Although I…15 16…I could not control.…17   4…felt tired,…5 7…very congested.…8 9…congestion cleared, …10 11…annoying dry cough …12 15…felt fine,…16 17…the coughing during the play.…18   1…a cold…2 19…continued to cough.   I had…1 2…of moderate severity.…3 5…but I had to work.…6 13…I had tickets to a play that I had looked forward to for two months.…14 18 I bought two kinds of cough drops and cough syrup but still19

3

  1usual, more2 3needed to nap several times during day. Had a 4 5and progressive difficulty working all day.…6   Energy level felt lower than…1 2easily tired,3 4productive cough…5       6…Felt like probable bronchitis and arranged for antibiotics. Recovery noted after several weeks.
                 

 

Intercoder agreement (the text in the column labeled "1 & 2" ) shows us the core features for a theme. In Figure 1, core signs and symptoms include myalgrams, headaches, sore throat, tired, contested/congestion, cough/coughing, and lower energy levels.

In contrast, intercoder disagreement (the text in the columns labeled "1 Only" and "2 Only") shows the theme's peripheral features. In Figure 1, peripheral features include subthemes associated with general discomfort and hassle (uncomfortable, miserable, could not control, needed to nap, and difficulty working all day) and time (two weeks, after the, continued to, at times, several times during the day, and all day). Similar words and phrases are relatively uncommon in the core features, suggesting that they may be systematic differences rather than the results of one or another coder forgetting to code a particular phrase.

Since there is more text in the column "1 Only" than there is in "2 Only," I can see that Coder 1 has a tendency to mark more text than does Coder 2. It turns out that Coder 1 always marked the entire sentence while Coder 2 marked exact phrases. These two approaches have different advantages. Marking phrases provides a narrower and more concise summary of the theme, while marking sentences broadens the theme to less obvious aspects of signs and symptoms such as associations with discomfort and time.

Coders also agreed about what text should not be marked as pertaining to signs and symptoms. The text in the column marked "Neither 1 nor 2" is indicative of the extreme boundaries of the theme. For instance, neither of these two coders felt that the phrase "of moderate severity" pertained to signs and symptoms, yet one of the coders felt that "a cold" did. The "left over" text is also a good place to look for additional themes. For example, much of the last column pertains to the interruption of daily routine and treatments. It is also a good place to look for phrases that both coders may have failed to mark.

Multiple Coders

With only two coders, the prototypicality of responses pertaining to any theme is difficult to assess. With multiple coders, however, the task is a little easier. First, I calculated the intercoder word frequency -- i.e., the number of times that the ten coders marked each word. The numbers ranged from zero (no coders marked the word) to ten (all coders marked the word). I assume that the more coders who identify words and phrases as pertaining to a given theme, the more prototypical the text.

I built a set of programs to read a theme's matrix and identify words that: a) at least one coder had marked as pertaining to the theme (intercoder word frequency >0); and b) did not belong to the common word list. (This can also be done in the database program Access using the structured query language.) Once I identified the key text for the theme, I formatted the output based on the intercoder word frequencies. I printed the text marked by all ten coders in 20 point font, the text marked by nine coders in 18 point font, the text marked by eight coders in 16 point font, and so on. The bigger the font, the more prototypical the text. Think of large fonts as being closer to the conceptual bull's eye. (Instead of font size, I could have used variations in color or background shading.)

I also created a mirror image of the prototypicality output. I printed the text marked by one coder in 20 point font, the text marked by two coders in 18 point font, the text marked by three coders in 16 point font, and so on. In this case, the larger the font, the more periphery the text is to the theme.

Figure 3 Intercoder agreement as represented by font size for two illness descriptions.

ID

 

Prototypicality

 

Mirror of Prototypicality

1

  uncomfortable fevers associated myalgrams, headaches, sore throat. unable usual activities not enjoy social events miserable.   uncomfortable fevers associated myalgrams, headaches, sore throat. unable usual activities not enjoy social events … miserable. …

2

  cold moderate severity. felt tired, work. congested. congestion cleared, annoying dry cough weeks embarrassing times. Although felt fine, … could not control … coughing during … play. … still continued … cough.   cold moderate severity. felt tired, work. congested. congestion cleared, annoying dry cough weeks embarrassing times. Although felt fine, … could not control … coughing during … play. … still continued … cough.

 

Figure 3 shows the two display techniques side-by-side. In the figure's second example, the core concepts related to signs and symptoms are tired, congested, and cough followed by concepts related to felt, congestion, cleared, annoying, and to some extent weeks. The mirror image, shows that moderate severity, embarrassing, times, and felt fine, occupy more periphery position within the theme. By juxtaposing core and periphery concepts, I have a more sophisticated manner for describing abstract themes.

The techniques describe above are useful for displaying the prototypicality of words in context, but for large corpuses of text such techniques are not very practical for describing general patterns. In such cases, an even more concentrated format is needed. I combined basic techniques from content analysis to list the words that coders agree belong to a given theme. Figure 4 shows the most prototypical words associated with signs and symptoms. The first column of words are those that all ten coders agreed belonged to the S/S theme. These words are ranked according to how often each word appeared in the text. For example, the word throat occurred 14 times in the text, and in at least one occurrence, all ten coders agreed the word pertained to the S/S theme. Unlike classic content analysis that associates high frequency words with theme salience, this technique identifies words that are pertinent to a theme but may have low frequencies. For example, coders always associated shaking and sweats with signs and symptoms, even though both words only occurred once in the illness descriptions.

Figure 4. Word frequency ranked by intercoder agreement (common words eliminated).

Intercoder Agreements

10

 

9

 

8

 

7

 

6

 

Freq. Word

 

Freq. Word

 

Freq. Word

 

Freq. Word

 

Freq. Word

 

24

feel(ing, felt)  

14

last(ed, ing)  

33

day(s)  

17

work  

8

because  

14

throat  

8

not  

6

slept  

10

home  

7

ago  

12

sore  

7

developed  

3

lot  

9

so  

6

flu  

11

cough(ing)  

4

mild  

2

second  

9

weeks  

6

next  

11

fever(s, ish)  

4

started  

1

associated  

5

symptoms  

4

just  

6

congest(ed,ion)  

4

well  

1

clogged  

3

before  

3

didn’t  

6

tired  

3

bedridden(bed)  

1

extremely  

3

began  

3

during  

4

achy(es,ing)  

3

degrees  

1

immediately  

3

several  

3

few  

4

nasal  

3

did  

1

nap  

2

end  

3

morning  

3

chills  

3

usual  

1

needed  

2

friday  

2

got  

3

nose  

2

clear(ed)        

2

night  

1

typical  

3

over  

2

horrible        

2

really        

2

fatigue(d)  

2

pretty        

2

staying        

2

headache(s)  

2

severe        

2

still        

2

level  

1

alternately        

2

third        

2

low(er)  

1

annoying        

2

times        

2

malaise  

1

appeared        

1

concentration        

2

mydrias(grams)  

1

body        

1

difficulty        

2

productive  

1

light        

1

evening        

2

rhinorrhea  

1

lymphadenopatry        

1

experienced        

2

sneezing  

1

minimal        

1

followed        

2

vomiting  

1

muscles        

1

high        

2

whole              

1

progressive        

1

discharge              

1

spent        

1

drip              

1

subsequently        

1

dry              

1

uncomfortable        

1

easily              

1

waning        

1

energy              

1

weak        

1

generalized                          

1

grade                          

1

headedness                          

1

lethargic                          

1

malcongestion                          

1

mydrias                          

1

nonproductive                          

1

post                          

1

runny                          

1

shaking                          

1

sweats                          

1

than                          
                             

The set of words on which all informants agreed tend to be related to physiological indicators. These words occupy the core part of the sign/symptom construct. The second set has a number of words related to severity (e.g., mild, degrees, horrible, several, annoying, minimal), suggesting that evaluations of illnesses might be an important subtheme. Sets further to the right indicate peripheral subthemes. For example, words related to time (e.g., days, ago, weeks, before, end, night, morning, followed, next), and behaviors (e.g., slept, nap, work, concentration) are found in the last three columns.

Measuring the Prototypicality of Quotes

The word analysis techniques described above are useful for depicting the range and central tendency of a theme. Researchers, however, typically want to use prototypical examples and quotes in their descriptions. Two problems arise: a) How does the investigator identify the most prototypical quotes?; and b) How can a critic or a reviewer be sure that the selected quotes or examples are indeed representative of the text being analyzed?

 

 

 

Figure 5. Sentences related to signs and symptoms sorted by weighted prototypicality scores.
   

Prototypicallity

   
       

Total

 

Weighted

   

ID

 

Raw

 

Score

 

Rank

 

Score

 

Rank

  Sentence

11

 

10

 

40

 

36

 

10.0

 

1

  Sore throat, rhinorrhea, mydrias.

10

 

10

 

44

 

34

 

9.8

 

2

  Minimal sore throat, nasal discharge.

17

 

10

 

72

 

25

 

9.0

 

3

  My nose was congested and I was sneezing.

22

 

10

 

116

 

10

 

8.9

 

4

  I had a headache, post nasal drip, low grade fever and felt horrible.

13

 

10

 

79

 

22

 

8.8

 

5

  I felt fatigued and experienced generalized malaise and myalgias.

8

 

10

 

52

 

33

 

8.7

 

6

  Subsequently congestion and mild cough appeared.

4

 

10

 

137

 

6

 

8.6

 

7

  Energy level felt lower than usual, more easily tired, needed to nap several times during day.

17

 

10

 

42

 

35

 

8.4

 

8

  I felt lethargic and tired.

12

 

10

 

134

 

7

 

8.4

 

9

  I started coughing and then developed some malcongestion: a sore throat as well as a fever.

2

 

10

 

107

 

13

 

8.2

 

10

  I was uncomfortable with fevers associated with myalgrams, headaches, and a sore throat.

19

 

10

 

171

 

4

 

8.1

 

11

  On Friday night I developed a fever of 103 degrees, and spent that night in bed alternately with chills and sweats.

6

 

10

 

341

 

1

 

8.1

 

12

  It began for me with a mild sore throat and a mild fever but by the second day my fever was up to 101 degrees F. by the third day up to 103 degrees F. and my sore throat was pretty severe.

3

 

10

 

32

 

40

 

8.0

 

13

  I was very congested.

21

 

10

 

104

 

15

 

8.0

 

14

  I was congested, achy and feverish with the symptoms lasting about 4-5 days.

9

 

10

 

198

 

2

 

7.9

 

15

  Five weeks ago I developed sore throat that lasted two days followed by about ten days of nasal congestion, clear rhinorrhea and mild nonproductive cough.

Drawing from our previous analysis of words, I calculate three measures of sentence prototypicality. Figure 5 shows the sentences that are the most representative of signs and symptoms for our ten coders. I calculated a raw agreement score by counting the number of coders that had marked at least one word in a sentence as pertaining to the S/S theme. The raw agreement score consists of integers from zero to ten and is simple to explain. From a practical perspective, however, this score was not very helpful for our purposes as it produced a lot of ties and tended to give higher scores to longer sentences.

Next I calculated a total agreement score by counting the number of coders that had marked each word and then summing the counts across all the words in a sentence. The minimum score was zero and the maximum score was the number of coders (in this case ten) multiplied by the number of words in the longest sentence (in this case 42). The range of score values was much higher than the raw agreement scores, thus allowing for finer distinctions between sentences. Although longer sentences still had a better chance of scoring higher than did shorter sentences, high scores identified sentences containing the most signs and symptoms.

Finally I calculated a weighted agreement score by taking the total agreement score for a sentence and dividing it by the number of words in the sentence. Scores ranged from zero to ten (the total number of coders) and can be considered a measure of a sentence's potency in regards to a theme. Those sentences that score high tend be extremely pithy in regards to the S/S theme. For example, the first sentence in Figure 5, "Sore throat, rhinorrhea, mydrias" has a weighted agreement score of ten because all ten coders marked all four words as pertaining to the S/S theme.

By ranking the sentences according to total agreement scores, I can identify those sentences that coders not only agreed pertained to the theme, but also contained the most prototypical words. By ranking the sentences according to the weighted agreement scores, I identify prototypical but pithy sentences that pertain to the theme. In general, the two types of scores are quite similar. In fact, I find that the scores on all 107 sentences are correlated at r = 0.77 (p< 0.001), and the correlation of the ranks is 0.85 (p<0.001).

Discussion

Intercoder agreement has a variety of functions in the analysis of text.

Intercoder-agreement-as-reliability: Agreement between coders tells investigators the degree to which coders can be treated as "reliable" measurement instruments (e.g., Carey et al. 1996). High degrees of intercoder reliability means that multiple coders are applying the codes in the same manner. If one coder marks half of the data and another coder marks the other half, investigators need to know that both coders are performing more or less the same tasks. Normally the reliability test is done by having both coders independently code a sample of the entire text.

Intercoder reliability is particularly important if the coded data will be analyzed statistically. If coders disagree, then the coded data are inaccurate. Discrepancies between coders is considered error and affects analysis calculations. One of the advantages of using content analysis and word dictionaries (Stone et al. 1966) is that such techniques are 100% reliable because they always mark exactly the same text. Of course, content analysis can not judge the more subtle meaning of statements as human coders can.

Intercoder-agreement-as-reliability also is important for text retrieval tasks. After coding, researchers usually want to search through their texts and find examples of a particular code. An investigator who uses a single coder to mark themes, relies on the coder's ability not to miss examples. Having multiple coders mark a text increase the likelihood of finding all the examples in a text that pertain to a given theme.

Intercoder-agreement-as-validity: Mitchell (1979) noted that most qualitative analysis use intercoder-agreement to measure construct validity rather than measurement reliability.

Demonstrating that multiple coders can pick the same text as pertaining to a theme shows that the theme is not just a figment of the primary investigator's imagination. Validity could be further increased if informants (rather than investigators) acted as coders.(1)



Intercoder-agreement-as-construct-definition: Theme identification and definition is part of the inductive research process. It begins when investigators try to define the themes that they find emerging from their texts. After reading over the corpus, team members discuss what constructs or themes they want to examine, and what kind of things "count as" a particular construct. This is the process of building a codebook or theme list.

As Lynn Richards, one of the authors of the text management software Nudist, said in a note to a qualitative list server: "the central qualitative process is usually seeing and discussing diverging interpretations of codes. I print summary reports, pin them on the wall and talk!" When developing or refining codebook definitions, the differences among researchers' interpretations are not measured systematically, nor should they be. The idea for the investigative unit is to come to some agreement as to what "counts as the construct." In most cases, the constructs are "fuzzy" and require prototypical examples rather than strict logical definitions.

In this article, I have suggested two additional ways to use intercoder agreement. First, intercoder agreement measures can be used as a tool to systematically describe the range, central tendency, and distribution of responses within a theme. They are particularly useful for identifying gradations of core and periphery structures within abstract constructs. Second, intercoder agreement is a measurement device for identifying and ranking prototypical quotes from informants. Critics and reviewers often want to know to what degree the quotes and examples used by investigators are indeed representative of informant's texts. Intercoder agreement measurements provide an answer.

Addendum: How many coders is enough?

By far the most common response to reviews of this article has been, "But I can't afford to use ten coders. How many is enough?" It is a fair, but tough question. Below I give some general thoughts on the matter. The answer seems to depend on: a) the ability of the coder to identify themes, b) the core/periphery dispersion of the theme, c) the number of times that any given theme appears in the text, and d) the levels of specificity investigators wish to achieve.

The last two constraints are similar to the sampling problems Bernard and Killworth (1993) solved for time allocation research. They showed that the rarer an event's occurrence in a population, the more you have to sample to assure you find it with any confidence. They also showed that sample size depends on whether you want to be sure of identify at least one occurrence of an event, or you wanted to know the frequency of an events occurrence in the population within a particular confidence interval.

In the case of text analysis, (unlike time allocation), the population is known -- its the entire corpus of text that has been collected. The unknowns are the rate of a theme's occurrence and each coder's ability to identify the theme when it occurs. It stands to reason that if a theme occurs a lot, a single coder is likely to find at least one example of the theme, even if the coder is not very good at identifying themes. If the theme occurs rarely, however, the likelihood of finding a single example decreases. It decreases even more, if the coder isn't very good. Unfortunately, investigators usually are willing to miss a few examples of a theme that occurs a lot but can't afford to miss any examples of a theme that occurs rarely. It makes sense, therefore, that the rarer a theme's occurrence and the more important it is to find all occurrences, the more coders you want to look for it.



The number of coders needed to identify aspects of core/periphery structures in constructs depends on level of distinctions an investigator wants to make. I see themes and abstract constructs as targets made up of concentric circles. The more coders you add, the more circles you have in the target. With a single coder, you can not distinguish between core and periphery features of a theme. Table 2 shows what can be learned about core/periphery features with just two coders, and Table 4 shows the kinds of distinctions that can be made with ten coders.

In hindsight, I can use the multiple coders to calculate the probability of any one coder associating a single word with a particular theme. Table 4 shows that any single coder would probably have associated any of the words in the first column with the S/S theme. The single coder would have had a 90% probability of identifying those words in the second column, an 80% probability for those in the third column, and so forth. It becomes apparent that investigators interested in confidently identifying and describing the peripheral aspects of a theme will probably want to use multiple coders.

Likewise, since all the quotes shown in Table 5 were marked by all the coders, I can assume that any single coder would have found them. Of course, a single coder would have also identify quotes that were less prototypical as well. With a single coder, however, there would be no way to separate less prototypical quotes from the more prototypical quotes. Increasing the number of coders will not help find more core quotes, it will allow investigators to distinguish among quotes in a replicable manner. It also seems reasonable to assume that the less well defined a construct, the more coders are needed to describe it in detail.

I find it helpful to recognize the limitations and advantages of single and multiple-coder research. It seems plausible that for some tasks investigators can rely on a single coder and for other tasks they should use multiple coders. Ultimately, it is the investigator's responsibility to identify the goals of the research and determine what kind of coding is required.

Acknowledgments

I would like to thank Kathleen MacQueen for steering us toward the problem of identifying and measuring prototypicality and Russell Bernard for his insights into potential solutions to this problem. I would also like to thank the Clinical Scholars who collected and coded the texts.



References

Bernard, H. Russell and Peter D. Killworth

1993 Sampling in time allocation research. Ethnology 32:207-215.

Carey, James W., Mark Morgan and Margaret J. Oxtoby

1996 Intercoder agreement in analysis of responses to open-ended interview questions: Examples from tuberculosis research. Cultural Anthropology Methods Journal 8(3):1-5.

Krippendorff, Klaus

1980 Content analysis: An introduction to its methodology. Beverly Hills: Sage Publications.

Miles, Matthew B. and A. Michael Huberman

1994 Qualitative data analysis: an expanded sourcebook. 2nd ed. Thousand Oaks, CA: Sage Publications.

Mitchell, Sandra K.

1979 Interobserver agreement, reliability, and generalizability of data collected in observational studies. Psychological Bulletin 86:376-390.

Roberts, Carl W. 1997. A theoretical map for selecting among text analysis methods. In: Text Analysis for the Social Sciences: Methods for Drawing Statistical Inferences from Texts and Transcripts. Carl W. Roberts, ed., NJ: Lawrence Erlbaum Associates. pp. 275-283.

Ryan, Gery

1996 Fieldnote Searcher, 1.0. Los Angles: Fieldwork & Qualitative Data Laboratory, UCLA.

Ryan, Gery and Thomas Weisner

1996 Analyzing words in brief descriptions: Fathers and mothers describe their children. Cultural Anthropology Methods Journal 8(3):13-16.

Stone, Philip J., M. S. Dunphy and D. M. Ogilvie

1966 The general inquirer: A computer approach to content analysis. Cambridge: MIT Press.

Truex, Gregory F.

1993 Tagging and typing: Notes on codes in anthropology. Cultural Anthropology Methods Journal 5(1):3-5.

Weber, Robert Philip

1990 Basic content analysis. 2nd ed. Newbury Park, CA: Sage Publications.

Endnotes

1. I would like to thank Roy D'Andrade for this suggestion.

[geneva97/eop.htm]