MB 813
Multivariate Methods
Carroll School of Management, Boston College

 

Measures of Similarity
Steve Borgatti, Boston College

 

Assume that we are measuring the similarity between vector X and vector Y. We use X* and Y* to refer to the canonical normalizations (or uniformed versions) of the X and Y.

Generic Measure of Similarity

  • If X* indicates the uniformed version of X, then Zegers & ten Berge family of association measures can all be described by the same equation:

  • (s +1)/2

Absolute Scale Data

  • Identity coefficient. Scale differences not normalized away

  • Not mentioned by Z & ten B is the Euclidean distance coefficient. This measure is not normed -- varies from 0 to ??

Ratio Scale Data

  • Tucker's congruence = coefficient of proportionality. Differences in amplitude normalized away

Additive Scale Data

  • Coefficient of additivity = Winer's I

Interval Scale Data

  • Pearson correlation = coefficient of linearity

Ordinal data

  • Spearman's rho = r(X*,Y*)
  • Goodman and Kruskal Gamma = (P - Q)/(P + Q), P is concordant pair and Q is discordant
  • example:
X Y
1 1 1
2 1 2
3 2 1
4 2 1
5 3 1
6 3 1
7 3 2

 

1 2 3 4 5 6 7
1   n n n n n p
2     q q q q n
3       n n n p
4         n n p
5           n n
6             n
7              

P = 3, Q = 4, gamma = -1/7

Or do it via contingency table:

1 2
1 1 1
2 2 0
3 2 1

P = 1*(0+1) + 2*(1) = 3

Q = 1*(2+2) +0*(2) = 4

Gamma = -1/7

Another example:

City Size/Arenas Small Medium Large
Weak Mayor a = 10 b = 5 c = 2
Strong Mayor d = 10 e = 15 f = 20

P = a(e+f) + bf = 10(15+20) + 5*20 = 450
Q = c(d+e) + bd = 2(10+15) + 5*10 = 100
gamma = (P - Q)/(P + Q) = (450-100)/(450 + 100) = .636

Presence/Absence Data

  • Simple matches
  • Jaccard
  • Gamma / Yule's Q
    • (ad-bc)/(ad+bc)
    • (OR-1)/(OR+1)

Nominal Data

  • chi-square
  • cramer's v

  • (equals phi when table is 2 by 2