6.0 INTRODUCTION
Clustering is the art of grouping together pattern vectors that in some sense belong together because of similar characteristics. In the most general problem the number of clusters or groups are unknown as are the properties that make them similar. In this chapter various techniques will be explored that perform clustering of a given set of pattern vectors. In particular the K-means algorithm, Hierarchical clustering, ISODATA algorithm, fuzzy K means algorithm, and Adaptive Resonance Theory algorithm are explored. This list is not exhaustive but considered as representative of methods for clustering numerical pattern vectors.
6.1 MATHEMATICAL CONCEPT OF CLUSTERING
Suppose that a set of n-dimensional pattern vectors { xj , j=1,2, ... ,NS } is given, as shown in Figure 6-1. Clustering is the process of partitioning those pattern vectors into a number of groups or clusters , call them Clk, k= 1,2, ... , K such that none of them are empty, they are pair by pair disjoint (their intersections are the null set), and they are mutually exhaustive(their union is the entire set of all samples considered). The clusters are formed to reflect some unspecified sense of similarity. The elements in cluster Clk could represent vectors from a particular pattern Class Ck if desired.
6.2 DISTANCES BETWEEN CLUSTERS
In section 3.1 the distance between pattern vectors was presented. It was shown that to be a legitimate distance function, in the sense of the definition given, the function must satisfy the positivity, commutativity and triangular inequality conditions.
The most common and familiar distance function between vectors x and y is the Euclidean distance function defined on a given vector space as:

A few other frequently used distance functions between vectors are the maximum component distance and the sum of absolute component distances defined in section 3.7.
Many clustering techniques use the distance between clusters in their procedure, in particular the K-means, the Hierarchical clustering procedure and ISODATA clustering algorithms. As clusters are sets of pattern vectors an extension of the definition of the distance between vectors is required. There are many useful defined distance functions between clusters and the choice is usually based on known properties of the problem at hand and processing simplicity and limitations.
Since clusters are collections of vectors they can be described as subsets. Let Si and Sj be clusters, i.e. subsets of a given S such that Si and Sj are not , the null set, and Si and Sj are disjoint (Si intersect Sj=). Some common distance functions between subsets or clusters, based on subjective measures are the nearest neighbor distance, average distance, distance between means, distance between medians, and farthest neighbor distance. Each of these measures is defined as follows for Si and Sj being subsets of the total pattern space S.
(1) Nearest Neighbor


(3) Distance Between Means

If medi represents the vector of components that are the median of vector components for cluster Si.


6.3 K-MEANS CLUSTERING ALGORITHM
Define a performance measure J as follows

The details of the K-Means algorithm follow and represent alternating
minimization of J with respect to cluster centers and cluster assignments. First J is
minimized with the cluster centers fixed by varying the cluster assignment and
then J is minimized with cluster assignments fixed and cluster centers varied.
K-Means Clustering Algorithm
Step 1 Initialization
Choose K initial cluster centers, call them M1(1), M2(1), ... , MK(1). These K, n by 1 vectors could be chosen at random from all the pattern vectors given, or be the first K samples, or be samples based on some apriori information regarding physical clusters. Set the iteration index m equal to 1 and go to step 2.
Step 2 Determine New Clusters
Using the cluster centers Mk(m) at iteration m, distribute the pattern vectors by using the nearest neighbor rule. If the euclidean distance is used then pattern vector xj is assigned to cluster Clk(m) if

Step 3 Compute New Cluster Centers
Using the new clusters Clk(m), compute new cluster centers. The new cluster centers, Mk(m+1), k=1,2,...,K, can be determined as the averages of the patterns in each cluster as follows

Step 4 Check for Convergence
Using the cluster centers computed in the previous step, check to see if clustering is finished. Convergence will have occurred if none of the cluster centers are changed, that is

If Eq. (6-4) is not satisfied go to step 5.
Step 5 Check for Maximum Number of Iterations
As a safety valve, a maximum number of iterations called MAXIT can be assigned for stopping without convergence. If m>MAXIT then stop and announce that convergence was not obtained. If m<MAXIT then increment m by 1 and return to step 2.
The final cluster assignment is not unique as different initial conditions can give different clusters as shown in the following example.
EXAMPLE 6.1
Use the K-means algorithm with Euclidean distance to find clusters for the following set of pattern vectors shown in Figure 6- and given as

(a) using the first two samples, x1 and x2, as initial
cluster centers for the K-Means algorithm find a
separation of the pattern vectors into two clusters.
(b) using the first three samples, x1, x2, and x3 as initial cluster centers for the K-Means algorithm find a separation of the pattern vectors into three clusters.
Solution:
(a) The K-means algorithm step 1 is an initialization of the cluster centers and using the first two samples we have M1(1) and M2(1) as

| --- | x1 | x2 | x3 | x4 | x5 | x6 | x7 |
| d(xi,M1(1)) | 0 | 2 | 1.4142 | 4.2426 | 5 | 5 | 5.6569 |
| d(xi,M2(1)) | 2 | 0 | 1.4142 | 3.1623 | 3.6055 | 4.1231 | 4.4721 |
| Assignment | cl 1 | cl 2 | cl 1 | cl 2 | cl 2 | cl 2 | cl 2 |
Thus the new clusters Cl1(1) and CL2(1) become

The second iteration gives the new cluster assignment assigning the pattern to the cluster center that is closest. Calculating the distance from each sample to the cluster centers and assigning the pattern vector to the cluster centers gives the following assignment
| --- | x1 | x2 | x3 | x4 | x5 | x6 | x7 |
| d(xi,M1(1)) | 0.7071 | 1.5811 | 0.7071 | 3.5355 | 4.3012 | 4.3012 | 4.9497 |
| d(xi,M2(1)) | 4.2521 | 3.0463 | 2.8425 | 0.2828 | 0.8246 | 1.2166 | 1.4422 |
| Assignment | cl 1 | cl 1 | cl 1 | cl 2 | cl 2 | cl 2 | cl 2 |
Thus the new clusters Cl1(2) and CL2(2) become

| --- | x1 | x2 | x3 | x4 | x5 | x6 | x7 |
| d(xi,M1(1)) | 1.0541 | 1.0541 | 0.6667 | 3.3333 | 4.0138 | 4.1767 | 4.7376 |
| d(xi,M2(1)) | 4.9497 | 3.8079 | 3.5355 | 0.7071 | 0.7071 | 0.7071 | 0.7071 |
| Assignment | cl 1 | cl 1 | cl 1 | cl 2 | cl 2 | cl 2 | cl 2 |
which leads to the following clusters and cluster centers

Looking at Figure 6-4 it is seen that this particular clustering lines up with our visual clustering of the pattern vectors into two clusters.
(c) Using the first three pattern vectors x1, x2, and x3 as the initial cluster centers and following the steps of the K-Means algorithm the following iterations can be found
Iteration #1


Iteration #3


From Fig. 6.4 cl3(final) appears visually as a logical cluster. Since there appears to be only two main clusters the K-Means algorithm for 3 clusters is forced to break up the other three pattern vectors into two clusters. Thus it is seen that the choice of the number K of clusters is very important. There is no correct way to determine K as it is a function of the data creation which we do not have access to.
For many problems the K-Means algorithm may be used for several different values of K and final choice depends upon subjective measures. The following algorithm although a bit more complex in tits procedure has the ability to find a reasonable number for the number of classes as well as partition the pattern vectors into that number of clusters.
6.4 ISODATA CLUSTERING ALGORITHM
If the number of clusters is not known, clustering can be obtained using the K-means algorithm for increasing K until little reduction in performance J is obtained. The Iterative Self Organizing Data Analysis Technique A called the ISODATA algorithm provides another way to handle the problem of clustering when the number of classes is not known. It provides a set of rules for splitting and combining existing clusters to be used to obtain a final clustering. A rough flow diagram for the ISODATA Algorithm is shown in Figure 6-4.
The following definitions are needed before the algorithm can be described.
T= Threshold on the number of samples in clustering.
Nc= Approximate number of clusters.
S2=Maximum spread for parameter splitting
Dm=Maximum distance separation for merging.
NMAX=Maximum number of clusters that can be merged at each step.
The details of the algorithm are now presented.
ISODATA algorithm
Step 0 Initialization
Choose Nc cluster centers by any of the methods described in the K-means algorithm initialization procedure. Go to Step 2.
Step 1 Clustering
Partition the pattern vectors into clusters using the current cluster centers by assigning Xj to Pk using the minimum distance assignment procedure described in the K-Means algorithm. If any cluster has less than T members, decrease Nc and recluster. Continue this process until all clusters have T members.
Step 2 Splitting of Clusters
If Nc<2ND and iteration is odd then split any cluster whose samples form sufficiently disjoint groups according to the following rule
Criterion for splitting
Compute the average distance dk for samples in each cluster to their cluster and a measure of spread k2 defined as follows


(1) dk>d and Nk>2(T+1)
(2) Nc<ND/2
Process of splitting
The original cluster center is replaced by two
new centers displaced slightly along the axis of largest
variance where the amount of displacement is a small
fraction of k as shown in Figure 6-6.
Criterion for Merging
Compute the pairwise distances dij between cluster centers µi and µj as follows
If any pairs of distances correspond to a distance less than the threshold Dm,then the pairs are merged.
Process of Merging
Order all dij<Dm with the smallest first. The two clusters Pi and Pj represented by smallest dij are combined eliminating Pi and Pj from further merging. The next smallest distance between two clusters , neither of which is cluster Pi or Pj will also cause a merger of the two respective clusters. This process is continued until the number of merges equal NMAX or there are no more to merge whichever occurs first.
6.5 HIERARCHICAL CLUSTERING
Hierarchical clustering is a sequential approach which along the way provides clustering at many different number of clusters. The method uses an agglomerative (clumping) or divisive (splitting) approach and requires distance functions for calculating distance between clusters. At each level in the procedure for clumping a single pattern or group of patterns is joined with another sample or group of samples such that when combined the elements remain together through the remainder of the procedure.
6.5.1 Agglomerative Hierarchical Clustering
Let x1, x2, ... ,xN be N patterns that are to be clustered. The agglomerative method of Hierarchical clustering begins at level N with N clusters S1, S2, ... , SN each one containing a single pattern vector as shown in Figure 6-4. Level N-1 is obtained by computing the distance between each pair of clusters of level N, ie the pattern vectors themselves. The two vectors with the smallest distance, the closest together with respect to the distance function, are combined to produce a new cluster. This new cluster together with the clusters that were not combined are the clusters at level N-1 which are labeled S1(2),S2(2),...,SN-1(2) . Each level down defines a clustering of the pattern vectors into one fewer number of clusters by combining two clusters of the level above. The two clusters combined at each level are the closest to each other where closeness is determined by the given distance function between clusters. This requires the calculation of the distance between each pair of clusters using the given distance or similarity function. This process is continued until there is just one cluster S1(1) containing all the pattern vectors to be clustered. The overall history of the clustering procedure can be illustrated by what is called a dendrogram, an example of which appears in Figure 6-7.
An example showing the hierarchical agglomerative clustering is now
presented using the standard Euclidean distance squared as the distance between
pattern vectors and the distance between clusters as average distance between
clusters.
Example 6-2 [3]
(a) Use nearest neighbor Euclidean distance measure and perform hierarchical clustering on the following set of pattern vectors These pattern vectors are shown in Figure 6-8.

(b) give the resulting clusters at level #5 (5
clusters).
Solution. At Level 10 each pattern vector is its own cluster as follows
Taking each pair of these classes we compute the distance squared between them as

The two with the smallest distance, or distance squared as minimum is preserved over monotonic functions of positive numbers, are x2 and x6. Level 9 is obtained by combining these two to form a new cluster and bringing down all other clusters which are the pattern vectors themselves. This combination is illustrated at the top of the dendrogram for this problem that appears in Figure 6-8.
(b) At level 5 there are five clusters and as seen in Figure 6-8 they are given by

6.5.2 Divisive Hierarchical Clustering
Whereas agglomerative clustering begins with each element a cluster and then combines clusters using a distance measure, divisive hierarchical clustering begins with one cluster and then continually breaks these clusters into smaller and smaller clusters until the final step where each pattern vector is a cluster. The clusters at each level are examined and the one containing patterns that are the fartherest apart according to some distance measure are broken apart.
6.6 FUZZY CLUSTERING
In the Hierarchical and K-means algorithms patterns are always assigned to one cluster or another and as can be seen in the Hierarchical method this assignment is permanent whereas in the K-means algorithm it is not permanent but is definitely assigned to one cluster at each step of the algorithm. The purpose of this section is to describe the Fuzzy C-Means Algorithm which is based on fuzzy sets. It begins with a brief introduction to fuzzy sets and then describes the algorithm.
6.6.1 Fuzzy Sets
Given a set of vectors S={ x1, x2, ..., xN } which for us will be pattern vectors. Let A be an ordinary subset of S, i.e. a collection of elements from S. We can also specify a subset A by using the characteristic function µA() which is defined on S such that µA(xi)=1 if xiA and µA(xi)=0 if xi not a member of A.
For example if A={x1,x3,x5} then µA(xi) can be expressed as a function on the total space S giving the following vector of values [1,0,1,0,1,0,..., 0] as x goes from x1, x2, ... ,xN.
Thus a crisp clustering or partition of the set S into clusters can be defined by a set of characteristic functions of the above type. Subsets defined in this manner are sometimes called crisp subsets, as their entries are all zeros and ones and an element is a member of only one of the subsets.
A fuzzy set can be defined by extending
the concept of the characteristic function to
allow positive values between and including 0
and 1 for entries in the characteristic function.
A fuzzy subset of S is defined by a membership
function and this allows us to specify a degree
of membership rather than the all or nothing
specification in ordinary subsets. This mapping
is shown ion Figure 6-10.
The fuzzy set A is defined to be equivalent to specifying a membership function. For example:

A Fuzzy partitioning of S, composed of Fuzzy clusters F1, F2, ... , Fk, can then be described by membership functions of fuzzy sets defined as follows.


Example 6-6.
Assume that the pattern vectors to be
clustered are shown in Figure 6.9 One
possible crisp clustering could be expressed
for two clusters as follows

An example of a fuzzy clustering with F1 and F2 as the fuzzy clusters would be

These membership functions are shown in
Figure 6-10. For this problem that the
number of pattern vectors is six (Ns=6), the
number of clusters is 2 (k=2), and the
dimension of the pattern vectors is 2 (n=2).
Notice that summing down the columns of the membership functions gives


Compute the fuzzy centroids Vi, for i = 1, 2, ... , C using

Using the Vi, i = 1, 2, ... , C determined in step 2, compute the new fuzzy membership functions µi(xk) as follows

Step 4: Check for Convergence
The algorithm is said to converge when the membership functions at the (p+1)th step are the same as those at the pth iteration or alternatively when the performance J tends to be stabilized.
If the algorithm has converged then we stop and the current µi(k)'s specify the fuzzy clusters.
If convergence has not occurred, then return to step 2 if the number of iterations is less some predetermined maximum number of iterations, otherwise stop and announce no convergence was obtained.

Usually convergence occurs rather rapidly and different initial conditions can lead to different results. There is guaranteed convergence in a finite number iterations; however, the procedure could converge to a local minimum.
Example 6-4
Use the Fuzzy C-Means algorithm to find a clustering of the following set of pattern vectors


(a) Use the Fuzzy C-Means clustering algorithm to find two Fuzzy clusters. Try several different initial conditions. Comment on your results.
(b) Provide a crisp clustering of using the fuzzy clusters determined in part (a).
(c) Use the Fuzzy C-Means clustering algorithm to find three Fuzzy clusters. Do the results seem reasonable. Explain.
Solution:
(a) A random number generator was used to determine the following initial membership functions for the two clusters F1(1) and F2(1) and cluster centers.

Using the new cluster centers the following membership functions, new cluster centers , and performance can be calculated as

Continuing this process for ten more iterations the following membership functions for fuzzy clusters F1(15) and F2(15), cluster centers V1(15) and V2(15), and performance measure J(15) were obtained.

Trying different initial conditions and going through 17 iterations the following membership functions for fuzzy clusters F1(17) and F2(17), cluster centers V1(17) and V2(17), and performance measure J(17) were obtained.

Notice that for this particular problem the membership functions resulting from different initial conditions were the same. For most problems different membership functions will result.
(b) A crisp clustering into two clusters could be obtained by comparing the entries of the membership function. A 1 is placed in the crisp cluster if it has the highest membership value. Note in this case that there is a tie at .5 so the corresponding sample could be assigned to either cluster Cl1 or Cl2 , if assigned to Cl1 then we obtain the following crisp clusters


Problems
6.1 Use the k-means algorithm and the Euclidean distance to cluster the following set of pattern vectors

(b) Repeat (a) for five choices of initial clusters with random components between 0 and 4. Compare the results with those of (a) and comment on your results.
(c) With x1, x6, and x3 as initial cluster centers, find and identify 3 clusters.
6.2 The students in a digital signal processing course had the following semester averages. Names have been withheld to protect the innocent.


6.3 Given the following patterns:
(b) Give the resulting clusters for the two cluster case.
6.4 Using the data from problem 6.1.
(a) Perform an agglomerative Hierarchical clustering of the data from and illustrate the dendrogram for the process.
(b) Compare your results for the 2 and 3 clusters of Problem 6.1 and comment on your results.
(c) Does Hierarchical clustering always give a unique result. Explain.
6.5 A fuzzy clustering routine results in the following membership functions for three clusters as

(b) Use the minimum operator to find Cl1 intersect Cl2.
(c) If a fuzzy complement of a fuzzy set is defined by a membership function of 1-µ(x), find the fuzzy complement of Cl1 and show, using the maximum operator for fuzzy set union, that Cl1 U (complement of CL1) does not give the universe.
6.6 Use the Fuzzy C-means algorithm to cluster the data of Problem 6.1 for (a) number of clusters C=2 and C=3.
(b) Compare your results with that of Problem 6.1 for both number of clusters and comment on the comparison.
References
1. Pattern Recognition Principles, 2nd Edition, J.T. Tou and R.C. Gonzalez,
Addison-wesley, 1988. First edition 1974.
2. Classification, Estimation, and Pattern Recognition, Tzay Y. Young and
Thomas W. Calvert, Elsevier, 1974.
3. Pattern Classification and Scene Analysis, Richard O. Duda and Peter E.
Hart, Wiley, 1974.
4. An introduction to Mathematical Techniques in Pattern Recognition, Harry
C. Andrews, Wiley,1972.
5. Adaptive Pattern Recognition and Neural Networks, Yo-Han Pao, Addison-
Wesley, 1989.
6. Introduction to Statistical Pattern Recognition, Keinosuke Fukunaga,
Academic Press, 1972.
7. Mendel, Jerry A. Mendel, "Fuzzy Logic Systems for Engineering: A Tutorial", Proceedings of the IEEE, vol 83 No.3, March 1995.
8. Xie, Xuanli Lisa and Gerardo Beni, "A Validity Measure for Fuzzy
Clustering",PAMI-13, No. 8, pp 841-847, August 1991.
9. Fuzzy Sets, Neural Networks, and Soft Computing, Ronald K. Yager, Van Nostrad, 1994.
10. Neural Networks and Fuzzy Systems, Bart Kosco, Prentice Hall, 1992.
11. Decision, Estimation, and Classification, Charles W. Therrien, John Wiley
& Sons, 1989.
12. Fundamentals of Pattern Recognition, Edward A. Patrick, Prentice Hall,
1972.
13. Statistical Pattern Recognition, C. H. Chen, Hayden, 1973.
14. Pattern recognition and Machine Learning, K.S. Fu, Plenum Press, 1974.
15. Sequential Methods in Pattern Recognition and Machine Learning, K.S. FU,
Academic Press, 1968.
16. Fundamentals of Neural Networks, Laurene Fausett, Prentice Hall, 1994.