TOOLS > CLUSTERING > CLUSTER ADEQUACY

Contents - Index

TOOLS > CLUSTER ANALYSIS > CLUSTER ADEQUACY

PURPOSE Calculates standard measures of fit for specified clusters for a proximity matrix.

DESCRIPTION Given a partition of a proximity matrix of similarities into clusters, then this routine calculates the goodness of fit measures which try and capture the adequacy of the partitions. The measures calculated are eta, Newman and Girvan's modularity Q, Krackhardt and Stern's E-I, Freemans segregation measure S and Cohen's Kappa. The routine takes a proximity matrix and a partition matrix where the clusters are defined in the matrix columns. The measures are defined as follows. Eta is the correlation between the data matrix and an ideal structure matrix in which x(i,j)=1 if i and j are in the same cluster and 0 otherwise. Newman and Girvan's modularity Q is the fraction of edges that fall within the partition minus the expected such fraction if the edges were distributed at random, Q has a maxmimum value of 1-1/m where m is the number of clusters Qprime is a normalized version of this. Krackhardt and Stern's E-I index is the number of external ties minus the number of internal ties divided by the total number of ties. Freemans S is the expected number of edges between groups minus the number observed divided by the expected, it is set to zero if this is negative. Cohen's Kappa is similar in as much as it is the observed minus the expected divided by the maximum minus the expected. Note for similarity data we expect all except E-I to be positive (E-I needs to be close to minus one for a good partition in this case).

PARAMETERS
Input proximity dataset:
Name of file containing proximity matrix on which cluster fits are to be measured . Data type: Square symmetric matrix.

Input partitions:
Name of a UCINET dataset that defines the partitions (an attribute dataset) in its columns. All entries in a given column with the same value will be put in the same partition.

Output Block model: (Default = Moca).
Name of output file which contains the values of the various measures corresponding to the partitions in the columns of the partition matrix.

LOG FILE For each cluster a table of the frequencies and the proportion of actors in each cluster. This is followed by a table with the measures in the rows as defined in the description and the different partitions in the columns.

TIMING O(N^2).

COMMENTS There is a lot of debate about which measures should be used and no firm conclusions. There are many other variants possible.

REFERENCES