Chapter 6



CLUSTERING TECHNIQUES









6.0 INTRODUCTION



Clustering is the art of grouping together pattern vectors that in some sense belong together because of similar characteristics. In the most general problem the number of clusters or groups are unknown as are the properties that make them similar. In this chapter various techniques will be explored that perform clustering of a given set of pattern vectors. In particular the K-means algorithm, Hierarchical clustering, ISODATA algorithm, fuzzy K means algorithm, and Adaptive Resonance Theory algorithm are explored. This list is not exhaustive but considered as representative of methods for clustering numerical pattern vectors.



6.1 MATHEMATICAL CONCEPT OF CLUSTERING



Suppose that a set of n-dimensional pattern vectors { xj , j=1,2, ... ,NS } is given, as shown in Figure 6-1. Clustering is the process of partitioning those pattern vectors into a number of groups or clusters , call them Clk, k= 1,2, ... , K such that none of them are empty, they are pair by pair disjoint (their intersections are the null set), and they are mutually exhaustive(their union is the entire set of all samples considered). The clusters are formed to reflect some unspecified sense of similarity. The elements in cluster Clk could represent vectors from a particular pattern Class Ck if desired.



6.2 DISTANCES BETWEEN CLUSTERS

In section 3.1 the distance between pattern vectors was presented. It was shown that to be a legitimate distance function, in the sense of the definition given, the function must satisfy the positivity, commutativity and triangular inequality conditions.

The most common and familiar distance function between vectors x and y is the Euclidean distance function defined on a given vector space as:





A few other frequently used distance functions between vectors are the maximum component distance and the sum of absolute component distances defined in section 3.7.

Many clustering techniques use the distance between clusters in their procedure, in particular the K-means, the Hierarchical clustering procedure and ISODATA clustering algorithms. As clusters are sets of pattern vectors an extension of the definition of the distance between vectors is required. There are many useful defined distance functions between clusters and the choice is usually based on known properties of the problem at hand and processing simplicity and limitations.

Since clusters are collections of vectors they can be described as subsets. Let Si and Sj be clusters, i.e. subsets of a given S such that Si and Sj are not , the null set, and Si and Sj are disjoint (Si intersect Sj=). Some common distance functions between subsets or clusters, based on subjective measures are the nearest neighbor distance, average distance, distance between means, distance between medians, and farthest neighbor distance. Each of these measures is defined as follows for Si and Sj being subsets of the total pattern space S.





(1) Nearest Neighbor

(2) Average Distance

where Ni and Nj are the number of vectors in cluster Si and Sj respectively.

(3) Distance Between Means

(4) Distance Between Medians

If medi represents the vector of components that are the median of vector components for cluster Si.

(5) Farthest Neighbor

The norm in the above forms can be replaced by any distance function. Figure 6-2 conceptually illustrates a few of the given measures.



6.3 K-MEANS CLUSTERING ALGORITHM



Define a performance measure J as follows

The objective of the K-means algorithm is to select Pk and Mk such that J is minimized. Note that if we were to chose K equal to NS and Mk = Xk for k=1 to NS that the performance index J would go to zero. This would in a sense say that each pattern vector is its own cluster which would not be a realistic solution although mathematically correct from a minimum J standpoint. The K-means algorithm uses a fixed number, K, of clusters that is usually selected much less than the number of pattern vectors. With such a fixed K the minimization of J makes sense and any optimization algorithm could be used to find the best cluster assignment with respect to J. The K-means algorithm flow diagram is shown in Figure 6.2.



The details of the K-Means algorithm follow and represent alternating minimization of J with respect to cluster centers and cluster assignments. First J is minimized with the cluster centers fixed by varying the cluster assignment and then J is minimized with cluster assignments fixed and cluster centers varied.



K-Means Clustering Algorithm



Step 1 Initialization

Choose K initial cluster centers, call them M1(1), M2(1), ... , MK(1). These K, n by 1 vectors could be chosen at random from all the pattern vectors given, or be the first K samples, or be samples based on some apriori information regarding physical clusters. Set the iteration index m equal to 1 and go to step 2.



Step 2 Determine New Clusters

Using the cluster centers Mk(m) at iteration m, distribute the pattern vectors by using the nearest neighbor rule. If the euclidean distance is used then pattern vector xj is assigned to cluster Clk(m) if

Ties can be resolved randomly or by selecting the lowest index. Go to step 3.



Step 3 Compute New Cluster Centers

Using the new clusters Clk(m), compute new cluster centers. The new cluster centers, Mk(m+1), k=1,2,...,K, can be determined as the averages of the patterns in each cluster as follows

where Nk, k=1,2,...,K are the numbers of pattern vectors in respective clusters Clk(m), k=1,2,...,K. Go to step 4.



Step 4 Check for Convergence

Using the cluster centers computed in the previous step, check to see if clustering is finished. Convergence will have occurred if none of the cluster centers are changed, that is

If convergence occurs clustering is complete and the K clusters and cluster centers obtained are given by Clk(m+1) and Mk(m) respectively for k=1,2,..., K.

If Eq. (6-4) is not satisfied go to step 5.



Step 5 Check for Maximum Number of Iterations

As a safety valve, a maximum number of iterations called MAXIT can be assigned for stopping without convergence. If m>MAXIT then stop and announce that convergence was not obtained. If m<MAXIT then increment m by 1 and return to step 2.



The final cluster assignment is not unique as different initial conditions can give different clusters as shown in the following example.



EXAMPLE 6.1

Use the K-means algorithm with Euclidean distance to find clusters for the following set of pattern vectors shown in Figure 6- and given as



(a) using the first two samples, x1 and x2, as initial cluster centers for the K-Means algorithm find a separation of the pattern vectors into two clusters.

(b) using the first three samples, x1, x2, and x3 as initial cluster centers for the K-Means algorithm find a separation of the pattern vectors into three clusters.



Solution:

(a) The K-means algorithm step 1 is an initialization of the cluster centers and using the first two samples we have M1(1) and M2(1) as

Step 2 computes new cluster assignment assigning the pattern to the cluster center that is closest. Calculating the distance from each sample to the cluster centers and assigning the pattern vector to the cluster centers gives the following assignment



--- x1 x2 x3 x4 x5 x6 x7
d(xi,M1(1)) 0 2 1.4142 4.2426 5 5 5.6569
d(xi,M2(1)) 2 0 1.4142 3.1623 3.6055 4.1231 4.4721
Assignment cl 1 cl 2 cl 1 cl 2 cl 2 cl 2 cl 2




Thus the new clusters Cl1(1) and CL2(1) become

Step 3 computes new cluster centers using Eq 6- to get.

Since this was the first assignment no check for convergence can be done and MAXIT has not been reached so set m=2 (after incrementing) and return to step 3.

The second iteration gives the new cluster assignment assigning the pattern to the cluster center that is closest. Calculating the distance from each sample to the cluster centers and assigning the pattern vector to the cluster centers gives the following assignment

--- x1 x2 x3 x4 x5 x6 x7
d(xi,M1(1)) 0.7071 1.5811 0.7071 3.5355 4.3012 4.3012 4.9497
d(xi,M2(1)) 4.2521 3.0463 2.8425 0.2828 0.8246 1.2166 1.4422
Assignment cl 1 cl 1 cl 1 cl 2 cl 2 cl 2 cl 2


Thus the new clusters Cl1(2) and CL2(2) become

Step 3 computes new cluster centers using Eq 6- to get.

Since M1(2) and M2(2) do not equal M1(1) and M2(1) a new iteration is performed beginning with a new redistribution as follows

--- x1 x2 x3 x4 x5 x6 x7
d(xi,M1(1)) 1.0541 1.0541 0.6667 3.3333 4.0138 4.1767 4.7376
d(xi,M2(1)) 4.9497 3.8079 3.5355 0.7071 0.7071 0.7071 0.7071
Assignment cl 1 cl 1 cl 1 cl 2 cl 2 cl 2 cl 2

which leads to the following clusters and cluster centers

Since convergence has occurred the procedure is stopped and the final clustering is



Looking at Figure 6-4 it is seen that this particular clustering lines up with our visual clustering of the pattern vectors into two clusters.

(c) Using the first three pattern vectors x1, x2, and x3 as the initial cluster centers and following the steps of the K-Means algorithm the following iterations can be found

Iteration #1

Iteration #2





Iteration #3

Iteration #4 gives the same clustering as Iteration #3 and the procedure has converged and stopped with final clustering as



From Fig. 6.4 cl3(final) appears visually as a logical cluster. Since there appears to be only two main clusters the K-Means algorithm for 3 clusters is forced to break up the other three pattern vectors into two clusters. Thus it is seen that the choice of the number K of clusters is very important. There is no correct way to determine K as it is a function of the data creation which we do not have access to.

For many problems the K-Means algorithm may be used for several different values of K and final choice depends upon subjective measures. The following algorithm although a bit more complex in tits procedure has the ability to find a reasonable number for the number of classes as well as partition the pattern vectors into that number of clusters.





















6.4 ISODATA CLUSTERING ALGORITHM



If the number of clusters is not known, clustering can be obtained using the K-means algorithm for increasing K until little reduction in performance J is obtained. The Iterative Self Organizing Data Analysis Technique A called the ISODATA algorithm provides another way to handle the problem of clustering when the number of classes is not known. It provides a set of rules for splitting and combining existing clusters to be used to obtain a final clustering. A rough flow diagram for the ISODATA Algorithm is shown in Figure 6-4.



The following definitions are needed before the algorithm can be described.



T= Threshold on the number of samples in clustering.

Nc= Approximate number of clusters.

S2=Maximum spread for parameter splitting

Dm=Maximum distance separation for merging.

NMAX=Maximum number of clusters that can be merged at each step.



The details of the algorithm are now presented.

ISODATA algorithm



Step 0 Initialization

Choose Nc cluster centers by any of the methods described in the K-means algorithm initialization procedure. Go to Step 2.



Step 1 Clustering

Partition the pattern vectors into clusters using the current cluster centers by assigning Xj to Pk using the minimum distance assignment procedure described in the K-Means algorithm. If any cluster has less than T members, decrease Nc and recluster. Continue this process until all clusters have T members.



Step 2 Splitting of Clusters

If Nc<2ND and iteration is odd then split any cluster whose samples form sufficiently disjoint groups according to the following rule

Criterion for splitting

Compute the average distance dk for samples in each cluster to their cluster and a measure of spread k2 defined as follows

Let davg be the weighted average distance given by

If for any k k2>s2 then the cluster is split provided either

(1) dk>d and Nk>2(T+1)

(2) Nc<ND/2

Process of splitting

The original cluster center is replaced by two new centers displaced slightly along the axis of largest variance where the amount of displacement is a small fraction of k as shown in Figure 6-6.



Criterion for Merging

Compute the pairwise distances dij between cluster centers µi and µj as follows

If any pairs of distances correspond to a distance less than the threshold Dm,then the pairs are merged.



Process of Merging

Order all dij<Dm with the smallest first. The two clusters Pi and Pj represented by smallest dij are combined eliminating Pi and Pj from further merging. The next smallest distance between two clusters , neither of which is cluster Pi or Pj will also cause a merger of the two respective clusters. This process is continued until the number of merges equal NMAX or there are no more to merge whichever occurs first.

6.5 HIERARCHICAL CLUSTERING



Hierarchical clustering is a sequential approach which along the way provides clustering at many different number of clusters. The method uses an agglomerative (clumping) or divisive (splitting) approach and requires distance functions for calculating distance between clusters. At each level in the procedure for clumping a single pattern or group of patterns is joined with another sample or group of samples such that when combined the elements remain together through the remainder of the procedure.



6.5.1 Agglomerative Hierarchical Clustering

Let x1, x2, ... ,xN be N patterns that are to be clustered. The agglomerative method of Hierarchical clustering begins at level N with N clusters S1, S2, ... , SN each one containing a single pattern vector as shown in Figure 6-4. Level N-1 is obtained by computing the distance between each pair of clusters of level N, ie the pattern vectors themselves. The two vectors with the smallest distance, the closest together with respect to the distance function, are combined to produce a new cluster. This new cluster together with the clusters that were not combined are the clusters at level N-1 which are labeled S1(2),S2(2),...,SN-1(2) . Each level down defines a clustering of the pattern vectors into one fewer number of clusters by combining two clusters of the level above. The two clusters combined at each level are the closest to each other where closeness is determined by the given distance function between clusters. This requires the calculation of the distance between each pair of clusters using the given distance or similarity function. This process is continued until there is just one cluster S1(1) containing all the pattern vectors to be clustered. The overall history of the clustering procedure can be illustrated by what is called a dendrogram, an example of which appears in Figure 6-7.

An example showing the hierarchical agglomerative clustering is now presented using the standard Euclidean distance squared as the distance between pattern vectors and the distance between clusters as average distance between clusters.



Example 6-2 [3]

(a) Use nearest neighbor Euclidean distance measure and perform hierarchical clustering on the following set of pattern vectors These pattern vectors are shown in Figure 6-8.

(b) give the resulting clusters at level #5 (5 clusters).



Solution. At Level 10 each pattern vector is its own cluster as follows

Taking each pair of these classes we compute the distance squared between them as



The two with the smallest distance, or distance squared as minimum is preserved over monotonic functions of positive numbers, are x2 and x6. Level 9 is obtained by combining these two to form a new cluster and bringing down all other clusters which are the pattern vectors themselves. This combination is illustrated at the top of the dendrogram for this problem that appears in Figure 6-8.



(b) At level 5 there are five clusters and as seen in Figure 6-8 they are given by

Looking at the position of the pattern vectors in Figure 6-6 it is seen that S5(5) forms an obvious cluster visually. Note also that x1, x4, and x3 have yet to be combined. When the patterns are of small dimension the eye can usually do a good job of clustering using the standard Euclidean distance measures, however as the dimension of the pattern vectors increases past three it becomes impossible to perform the clustering visually.



6.5.2 Divisive Hierarchical Clustering

Whereas agglomerative clustering begins with each element a cluster and then combines clusters using a distance measure, divisive hierarchical clustering begins with one cluster and then continually breaks these clusters into smaller and smaller clusters until the final step where each pattern vector is a cluster. The clusters at each level are examined and the one containing patterns that are the fartherest apart according to some distance measure are broken apart.



6.6 FUZZY CLUSTERING

In the Hierarchical and K-means algorithms patterns are always assigned to one cluster or another and as can be seen in the Hierarchical method this assignment is permanent whereas in the K-means algorithm it is not permanent but is definitely assigned to one cluster at each step of the algorithm. The purpose of this section is to describe the Fuzzy C-Means Algorithm which is based on fuzzy sets. It begins with a brief introduction to fuzzy sets and then describes the algorithm.



6.6.1 Fuzzy Sets

Given a set of vectors S={ x1, x2, ..., xN } which for us will be pattern vectors. Let A be an ordinary subset of S, i.e. a collection of elements from S. We can also specify a subset A by using the characteristic function µA() which is defined on S such that µA(xi)=1 if xiA and µA(xi)=0 if xi not a member of A.

For example if A={x1,x3,x5} then µA(xi) can be expressed as a function on the total space S giving the following vector of values [1,0,1,0,1,0,..., 0] as x goes from x1, x2, ... ,xN.

Thus a crisp clustering or partition of the set S into clusters can be defined by a set of characteristic functions of the above type. Subsets defined in this manner are sometimes called crisp subsets, as their entries are all zeros and ones and an element is a member of only one of the subsets.

A fuzzy set can be defined by extending the concept of the characteristic function to allow positive values between and including 0 and 1 for entries in the characteristic function. A fuzzy subset of S is defined by a membership function and this allows us to specify a degree of membership rather than the all or nothing specification in ordinary subsets. This mapping is shown ion Figure 6-10.



The fuzzy set A is defined to be equivalent to specifying a membership function. For example:

Specifying µA is equivalent to specifying the fuzzy set A. A value close to 1 indicates a high degree of membership, whereas a value close to 0 represents a low degree of membership. The degree of membership above for x1 is .9. A value of .4 would represent a lower degree of membership than a value of .5.

A Fuzzy partitioning of S, composed of Fuzzy clusters F1, F2, ... , Fk, can then be described by membership functions of fuzzy sets defined as follows.

such that

The first inequality bounds the membership values between 0 and 1, the second states that the sum down each column must add to one, and the third excludes all zeros, the null set, and all ones, the total space S. For mathematical simplicity µk will sometimes be used for representing the fuzzy cluster Fk rather than the right side of Eq. 6-10. An example is now presented that illustrates both a fuzzy clustering and a crisp clustering.



Example 6-6.

Assume that the pattern vectors to be clustered are shown in Figure 6.9 One possible crisp clustering could be expressed for two clusters as follows

If membership function form for these crisp clusters is desired, it would be as follows





An example of a fuzzy clustering with F1 and F2 as the fuzzy clusters would be

These membership functions are shown in Figure 6-10. For this problem that the number of pattern vectors is six (Ns=6), the number of clusters is 2 (k=2), and the dimension of the pattern vectors is 2 (n=2).

Notice that summing down the columns of the membership functions gives

where

Step 2: Computation of Fuzzy Centroids Vi

Compute the fuzzy centroids Vi, for i = 1, 2, ... , C using

Step 3: Compute New Fuzzy Membership Functions

Using the Vi, i = 1, 2, ... , C determined in step 2, compute the new fuzzy membership functions µi(xk) as follows





Step 4: Check for Convergence

The algorithm is said to converge when the membership functions at the (p+1)th step are the same as those at the pth iteration or alternatively when the performance J tends to be stabilized.

If the algorithm has converged then we stop and the current µi(k)'s specify the fuzzy clusters.

If convergence has not occurred, then return to step 2 if the number of iterations is less some predetermined maximum number of iterations, otherwise stop and announce no convergence was obtained.







Usually convergence occurs rather rapidly and different initial conditions can lead to different results. There is guaranteed convergence in a finite number iterations; however, the procedure could converge to a local minimum.



Example 6-4

Use the Fuzzy C-Means algorithm to find a clustering of the following set of pattern vectors















(a) Use the Fuzzy C-Means clustering algorithm to find two Fuzzy clusters. Try several different initial conditions. Comment on your results.

(b) Provide a crisp clustering of using the fuzzy clusters determined in part (a).

(c) Use the Fuzzy C-Means clustering algorithm to find three Fuzzy clusters. Do the results seem reasonable. Explain.



Solution:

(a) A random number generator was used to determine the following initial membership functions for the two clusters F1(1) and F2(1) and cluster centers.

Using the new cluster centers the following membership functions, new cluster centers , and performance can be calculated as

Continuing this process for ten more iterations the following membership functions for fuzzy clusters F1(15) and F2(15), cluster centers V1(15) and V2(15), and performance measure J(15) were obtained.



Trying different initial conditions and going through 17 iterations the following membership functions for fuzzy clusters F1(17) and F2(17), cluster centers V1(17) and V2(17), and performance measure J(17) were obtained.

Notice that for this particular problem the membership functions resulting from different initial conditions were the same. For most problems different membership functions will result.

(b) A crisp clustering into two clusters could be obtained by comparing the entries of the membership function. A 1 is placed in the crisp cluster if it has the highest membership value. Note in this case that there is a tie at .5 so the corresponding sample could be assigned to either cluster Cl1 or Cl2 , if assigned to Cl1 then we obtain the following crisp clusters

(c) Using the Fuzzy C-Means Algorithm the following final membership functions are obtained

The results indicated that samples x1 and x2 have high membership values for F1, x3 has a high membership value for F2, and x4 and x5 have high membership values for F3. These particular assignments seem reasonable by letting our eyes cluster the samples into three classes. In most problems clustering visually is impossible because the dimension of the pattern vectors is usually much bigger than two.







Problems



6.1 Use the k-means algorithm and the Euclidean distance to cluster the following set of pattern vectors

(a) With x1 and x6 as initial cluster centers find and identify 2 clusters.

(b) Repeat (a) for five choices of initial clusters with random components between 0 and 4. Compare the results with those of (a) and comment on your results.

(c) With x1, x6, and x3 as initial cluster centers, find and identify 3 clusters.



6.2 The students in a digital signal processing course had the following semester averages. Names have been withheld to protect the innocent.

Use the K-means algorithms to assign the final grades using the following initial conditions.

6.3 Given the following patterns:

(a) Construct the dendrogram for the hierarchical clustering algorithm, showing all distances during each step. (use average distance measure)

(b) Give the resulting clusters for the two cluster case.



6.4 Using the data from problem 6.1.

(a) Perform an agglomerative Hierarchical clustering of the data from and illustrate the dendrogram for the process.

(b) Compare your results for the 2 and 3 clusters of Problem 6.1 and comment on your results.

(c) Does Hierarchical clustering always give a unique result. Explain.



6.5 A fuzzy clustering routine results in the following membership functions for three clusters as

(a) Describe a procedure for obtaining a "crisp" clustering of the pattern vectors from the fuzzy clusters given and find crisp clusters associated with the given fuzzy clusters (membership functions should have only 1's and 0's for crisp clusters).

(b) Use the minimum operator to find Cl1 intersect Cl2.

(c) If a fuzzy complement of a fuzzy set is defined by a membership function of 1-µ(x), find the fuzzy complement of Cl1 and show, using the maximum operator for fuzzy set union, that Cl1 U (complement of CL1) does not give the universe.

6.6 Use the Fuzzy C-means algorithm to cluster the data of Problem 6.1 for (a) number of clusters C=2 and C=3.

(b) Compare your results with that of Problem 6.1 for both number of clusters and comment on the comparison.



References



1. Pattern Recognition Principles, 2nd Edition, J.T. Tou and R.C. Gonzalez,

Addison-wesley, 1988. First edition 1974.



2. Classification, Estimation, and Pattern Recognition, Tzay Y. Young and

Thomas W. Calvert, Elsevier, 1974.



3. Pattern Classification and Scene Analysis, Richard O. Duda and Peter E.

Hart, Wiley, 1974.



4. An introduction to Mathematical Techniques in Pattern Recognition, Harry

C. Andrews, Wiley,1972.



5. Adaptive Pattern Recognition and Neural Networks, Yo-Han Pao, Addison-

Wesley, 1989.



6. Introduction to Statistical Pattern Recognition, Keinosuke Fukunaga,

Academic Press, 1972.



7. Mendel, Jerry A. Mendel, "Fuzzy Logic Systems for Engineering: A Tutorial", Proceedings of the IEEE, vol 83 No.3, March 1995.



8. Xie, Xuanli Lisa and Gerardo Beni, "A Validity Measure for Fuzzy

Clustering",PAMI-13, No. 8, pp 841-847, August 1991.



9. Fuzzy Sets, Neural Networks, and Soft Computing, Ronald K. Yager, Van Nostrad, 1994.



10. Neural Networks and Fuzzy Systems, Bart Kosco, Prentice Hall, 1992.

11. Decision, Estimation, and Classification, Charles W. Therrien, John Wiley

& Sons, 1989.



12. Fundamentals of Pattern Recognition, Edward A. Patrick, Prentice Hall,

1972.



13. Statistical Pattern Recognition, C. H. Chen, Hayden, 1973.



14. Pattern recognition and Machine Learning, K.S. Fu, Plenum Press, 1974.



15. Sequential Methods in Pattern Recognition and Machine Learning, K.S. FU,

Academic Press, 1968.



16. Fundamentals of Neural Networks, Laurene Fausett, Prentice Hall, 1994.