Contents - Index


PURPOSE Perform Johnson's hierarchical clustering on a proximity matrix.

DESCRIPTION Given a symmetric n-by-n representing similarities or dissimilarities among a set of n items, the algorithm finds a series of nested partitions of the items. The different partitions are ordered according to decreasing [increasing] levels of similarity [dissimilarity]. The algorithm begins with the identity partition (in which all items are in different clusters). It then joins the pair of items most similar (least different), which are then considered a single entity. The algorithm continues in this manner until all items have been joined into a single cluster (the complete partition).

Input dataset 
Name of file containing proximity matrix to be clustered. Data type: Square symmetric matrix.
Similarities or Distances? (Default = Similarities)
Whether items i and j should be clustered together when X(i,j) is large or when it is small. If data are Similarities, items i and j are clustered together if X(i,j) is very large. If data are Dissimilarities, items i and j are clustered together if X(i,j) is very small.
Output Partition matrix: (Default = 'Part')
Name of dataset to contain the partition-by-item indicator matrix. Each column of this matrix gives the cluster to which each item was assigned in a given partition. The columns are labeled by the level of the cluster.  A value of k in a column labeled x and row j means that actor j was in partition k at level x.  Actor k is always a member of partition k and is a representative label for the group.  It can be used by procedures like Transform>Block to obtain density matrices at any level of blocking.  This file is not displayed in the LOG FILE. 

Output Ultrametric matrix (if desired):
Name of dataset to contain the item-by-item ultrametric proximity matrix, if seleceted.

Method: (Default = WTD_AVERAGE)
Choices are:

Also known as the "minimum" or "connectedness" method. Distance between two clusters is defined as smallest dissimilarity (largest similarity) between members.

Also known as the "maximum" or "diameter" method. Distance between two clusters is defined as largest dissimilarity (smallest similarity) between members.

Distance between clusters is calculated as the average dissimilarity (or similarity) value weighted by cluster size.

Distance between clusters defined as average dissimilarity (or similarity) between members where the clusters are treated as single data points.

Graphical Dendrogram: (Default = Dendrogram)
The clustering can be shown as a dendrogram or a tree diagram.

Textual Dendrogram::(Default = Landscape)
Whether the textual output decsribed below is given in portrait or landscape format.

Maximum label length (Default = 15)
Number of label characters above which labels are truncated. 
Compute ultrametric proximity matrix? (Default = NO)
Hierarchical clustering can be seen as transforming a dissimilarity matrix into an ultrametric distance matrix. The ultrametric distances correspond monotonically to the number of iterations (partitions) needed to join a given pair of items.


LOG FILE Primary output are cluster diagrams. The first diagram (either a tree diagram or a dendrogram) re-orders the actors so that they are located close to other actors in similar clusters. The level at which any pair of actors are aggregated is the point at which both can be reached by tracing from the start to the actors from right to left. The scale at the top gives the level at which they are clustered. The diagram can be printed or saved. Parts of the diagram can be viewed by moving the mouse to the split point in a tree diagram or the beginning of a line in the dendrogram and clicking. The first click will highlight a portion of the diagram and the second click will display just the highlighted portion. To return to the original right click on the mouse. There is also a simple zoom facility simply change the values and then press enter. If the labels need to be edited (particularly the scale labels) then you should take the partition indicator matrix into the spreadsheet editor remove or reduce the labels and then submit the edited data to Tools>Dendrogram>Draw. The output also produces a standard Log file that contains a different cluster diagram the textual dendogram which looks like this:

        A B C D E F G H I J
Level 1 2 3 4 5 6 7 8 9 0
----- - - - - - - - - - -

In this example, the data were distances among 10 items, labeled A through J. The results are 4 nested partitions, corresponding to rows in the diagram. Within a given row, an 'X' between two adjacent columns indicates that the items associated with those columns were assigned to the same cluster in that partition. For example, in the first partition (level 1.000), items D and E belong to the same cluster, but C is a member of a different cluster. In the third partition (level 1.578), items D, E and C all belong to the same cluster. 

The levels indicate the degree of association (similarity or dissimilarity) among items within clusters. If, as in the example, the data are distances and the clustering method is single link, the a level of 1.578 means that every item within a cluster is no more than 1.578 units distant from at least one other item in that cluster. If the clustering method is complete link, a level of 1.578 indicates that every item in a cluster no more than 1.578 units distant from every other item in the cluster. For the average clustering method, a level of 1.578 indicates that the average distance among items within the cluster is 1.578.

For similarity data, the meaning of the levels for the single link and complete link methods is, in a sense reversed. For the single link method, a level of 1.578 means that every item in a cluster is at least 1.578 units similar to at least one other item in the cluster. For the complete link method, a level of 1.578 means that every item in a cluster is at least 1.578 units similar to every other item in the cluster.

A table of cluster adequacy with the rows giving the method and columns the clusters at various levels. The columns are ordered from the finest to the coarsest partition so that column 1 corresponds to the fist step in the clustering and the last column is the clustering before everything is placed in one cluster. 

A table giving the size of each cluster as a proportion of the total population. The rows are the clusters and the columns are the levels the i,j entry gives the proportion of members in cluster i at level j.



REFERENCES Johnson, S C (1967).  'Hierarchical clustering schemes'.  Psychometrika, 32, 241-253.