MB 813 Multivariate Methods Carroll School of Management, Boston College

Measures of Similarity
Steve Borgatti, Boston College

Assume that we are measuring the similarity between vector X and vector Y. We use X* and Y* to refer to the canonical normalizations (or uniformed versions) of the X and Y.

Generic Measure of Similarity

• If X* indicates the uniformed version of X, then Zegers & ten Berge family of association measures can all be described by the same equation:

• (s +1)/2

Absolute Scale Data

• Identity coefficient. Scale differences not normalized away

• Not mentioned by Z & ten B is the Euclidean distance coefficient. This measure is not normed -- varies from 0 to ??

Ratio Scale Data

• Tucker's congruence = coefficient of proportionality. Differences in amplitude normalized away

Additive Scale Data

• Coefficient of additivity = Winer's I

Interval Scale Data

• Pearson correlation = coefficient of linearity

Ordinal data

• Spearman's rho = r(X*,Y*)
• Goodman and Kruskal Gamma = (P - Q)/(P + Q), P is concordant pair and Q is discordant
• example:
 X Y 1 1 1 2 1 2 3 2 1 4 2 1 5 3 1 6 3 1 7 3 2

 1 2 3 4 5 6 7 1 n n n n n p 2 q q q q n 3 n n n p 4 n n p 5 n n 6 n 7

P = 3, Q = 4, gamma = -1/7

Or do it via contingency table:

 1 2 1 1 1 2 2 0 3 2 1

P = 1*(0+1) + 2*(1) = 3

Q = 1*(2+2) +0*(2) = 4

Gamma = -1/7

Another example:

 City Size/Arenas Small Medium Large Weak Mayor a = 10 b = 5 c = 2 Strong Mayor d = 10 e = 15 f = 20

P = a(e+f) + bf = 10(15+20) + 5*20 = 450
Q = c(d+e) + bd = 2(10+15) + 5*10 = 100
gamma = (P - Q)/(P + Q) = (450-100)/(450 + 100) = .636

Presence/Absence Data

• Simple matches
• Jaccard
• Gamma / Yule's Q
• (ad-bc)/(ad+bc)
• (OR-1)/(OR+1)

Nominal Data

• chi-square
• cramer's v

• (equals phi when table is 2 by 2