1 PREPARING FOR THE ANALYSIS

1.1 Install and load the package ppclust

This vignette is designed to be used with the ppclust package. You can download the recent version of the package ‘ppclust’ from CRAN with the following command:

install.packages("ppclust")

If you have already installed ‘ppclust’, you can load it into R working environment by using the following command:

library(ppclust)

1.2 Load the required packages

For visualization of the clustering results, some examples in this vignette use the functions from some cluster analysis packages such as ‘cluster’, ‘fclust’ and ‘factoextra’. Therefore, these packages should be loaded into R working environment with the following commands:

library(factoextra)
library(cluster)
library(fclust)

1.3 Load the data set

We demonstrate FCM on the Iris data set (Anderson, 1935). It is a real data set of the four features (Sepal.Length, Sepal.Width, Petal.Length and Petal.Width) of 150 iris flowers with three species (in the last column as the class variable). This four-dimensional data set contains 50 samples each of three iris species. One of these three natural clusters (Class 1) is linearly well-separated from the other two clusters, while Classes 2 and 3 have some overlap as seen in the plot below.

data(iris)
x=iris[,-5]
x

Plot the data by the classes of iris species

pairs(x, col=iris[,5])

2 FUZZY C-MEANS CLUSTERING

Fuzzy C-Means (FCM) is a soft custering algorithm proposed by Bezdek (1974; 1981). Unlike K-means algorithm in which each data object is the member of only one cluster, a data object is the member of all clusters with varying degrees of fuzzy memberhip between 0 and 1 in FCM. Hence, the data objects closer to the centers of clusters have higher degrees of membership than objects scattered in the borders of clusters.

2.1 Run FCM with Single Start

2.1.1 Initialization

In order to start FCM as well as the other alternating optimization algorithms, an initialization step is required to build the initial cluster prototypes matrix and fuzzy membership degrees matrix. Although this task is usually performed in the initialization step of the clustering algorithm, the initial prototypes and memberships can also be directly input by the user.

FCM is usually started by using an integer specifying the number of clusters. In this case, the prototypes matrix is internally generated by using any of the prototype initalization algorithms which are included in the package ‘inaparc’. The default initialization technique is K-means++ with the current version of fcm function in this package. In the following code block, FCM runs for three clusters with the default values of the remaining arguments.

res.fcm <- fcm(x, centers=3)

Although it is a usual way to run FCM with a pre-determined number of clusters, sometimes the users may wish to use a certain cluster prototypes matrix to start the algorithm. In this case, FCM uses the user-specified prototype matrix instead of internally generated ones as follows:

v0 <- matrix(nrow=3, ncol=4,
     c(5.0, 3.4, 1.4, 0.3,
       6.7, 3.0, 5.6, 2.1,
       5.8, 2.7, 4.3, 1.4),
       byrow=TRUE)
print(v0)

##      [,1] [,2] [,3] [,4]
## [1,]  5.0  3.4  1.4  0.3
## [2,]  6.7  3.0  5.6  2.1
## [3,]  5.8  2.7  4.3  1.4

res.fcm <- fcm(x, centers=v0)

In the following code block, there is another example for initialization of cluster prototypes for three clusters. v, the cluster prototype matrix is initialized by using the K-means++ initalization algorithm (kmpp) in the R package ‘inaparc’.

v0 <- inaparc::kmpp(x, k=3)$v
print(v0)

##      Sepal.Length Sepal.Width Petal.Length Petal.Width
## Cl.1          5.0         3.5          1.3         0.3
## Cl.2          6.5         3.0          5.2         2.0
## Cl.3          5.0         2.3          3.3         1.0

res.fcm <- fcm(x, centers=v0)

There are many other techniques for initialization of cluster prototype matrices in the R package ‘inaparc’. See the package documentation for other alternatives.

FCM can be alternatively started with a user-specified membership degrees matrix. In this case, the function fcm is called with the argument memberships as follows:

u0 <- inaparc::imembrand(nrow(x), k=3)$u
res.fcm <- fcm(x, centers=3, memberships=u0)

In order to generate the matrices of cluster prototypes and membership degrees, users can choose any of existing initialization algorithms in the R package ‘inaparc’. For this purpose, alginitv and alginitu can be speficied for cluster prototypes and membership degrees, respectively.

res.fcm <- fcm(x, centers=3, alginitv="hartiganwong", alginitu="imembrand")

In general, the squared Euclidean distance metric is used with FCM in order to compute the distances between the cluster centers and the data objects. Users can specify the other distance metrics with the argument dmetric as follows:

res.fcm <- fcm(x, centers=3, dmetric="correlation")

Although the argument m, fuzzy exponent is usually choosen as 2 in most of the applications using FCM, users can increase m to get more fuzzified results as follows:

res.fcm <- fcm(x, centers=3, m=4)

There are other input arguments available with the function fcm to control the run of FCM algoritm. For the details, see fcm in the package manual of ppclust.

2.1.2 Clustering Results

2.1.2.1 Fuzzy Membership Matrix

The fuzzy membership degrees matrix is the main output of the function fcm.

res.fcm <- fcm(x, centers=3)
as.data.frame(res.fcm$u)

2.1.2.2 Initial and Final Cluster Prototypes

Initial and final cluster prototypes matrices can be achieved as follows:

res.fcm$v0

##           Sepal.Length Sepal.Width Petal.Length Petal.Width
## Cluster 1          7.6         3.0          6.6         2.1
## Cluster 2          6.3         3.3          4.7         1.6
## Cluster 3          4.8         3.4          1.6         0.2

res.fcm$v

##           Sepal.Length Sepal.Width Petal.Length Petal.Width
## Cluster 1     6.775011    3.052382     5.646782   2.0535467
## Cluster 2     5.888932    2.761069     4.363952   1.3973150
## Cluster 3     5.003966    3.414089     1.482816   0.2535463

2.1.2.3 Summary of Clustering Results

The function fcm returns the clustering results as an intance of ‘ppclust’ class. This object is summarized with the summary method of the package ‘ppclust’ as follows:

summary(res.fcm)

## Summary for 'res.fcm'
## 
## Number of data objects:  150 
## 
## Number of clusters:  3 
## 
## Crisp clustering vector:
##   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 1 1 1 2 1 1 1 1
## [112] 1 1 2 1 1 1 1 1 2 1 2 1 2 1 1 2 2 1 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 2 1
## [149] 1 2
## 
## Initial cluster prototypes:
##           Sepal.Length Sepal.Width Petal.Length Petal.Width
## Cluster 1          7.6         3.0          6.6         2.1
## Cluster 2          6.3         3.3          4.7         1.6
## Cluster 3          4.8         3.4          1.6         0.2
## 
## Final cluster prototypes:
##           Sepal.Length Sepal.Width Petal.Length Petal.Width
## Cluster 1     6.775011    3.052382     5.646782   2.0535467
## Cluster 2     5.888932    2.761069     4.363952   1.3973150
## Cluster 3     5.003966    3.414089     1.482816   0.2535463
## 
## Distance between the final cluster prototypes
##           Cluster 1 Cluster 2
## Cluster 2  2.946292          
## Cluster 3 23.846049 10.818752
## 
## Difference between the initial and final cluster prototypes
##           Sepal.Length Sepal.Width Petal.Length Petal.Width
## Cluster 1   -0.8249888  0.05238227   -0.9532182 -0.04645334
## Cluster 2   -0.4110676 -0.53893064   -0.3360484 -0.20268496
## Cluster 3    0.2039660  0.01408886   -0.1171845  0.05354632
## 
## Root Mean Squared Deviations (RMSD): 0.8690926 
## Mean Absolute Deviation (MAD): 5.00608 
## 
## Membership degrees matrix (top and bottom 5 rows): 
##     Cluster 1   Cluster 2 Cluster 3
## 1 0.001072034 0.002304380 0.9966236
## 2 0.007497947 0.016649509 0.9758525
## 3 0.006414579 0.013759500 0.9798259
## 4 0.010107523 0.022465031 0.9674274
## 5 0.001767935 0.003761709 0.9944704
## ...
##     Cluster 1 Cluster 2  Cluster 3
## 146 0.8823507 0.1063871 0.01126223
## 147 0.4666788 0.5075252 0.02579593
## 148 0.8314467 0.1564396 0.01211367
## 149 0.7893823 0.1890364 0.02158126
## 150 0.3913000 0.5817811 0.02691888
## 
## Descriptive statistics for the membership degrees by clusters
##           Size       Min        Q1      Mean    Median        Q3       Max
## Cluster 1   40 0.5006317 0.7807561 0.8351480 0.8604619 0.9122633 0.9888134
## Cluster 2   60 0.5075252 0.6697398 0.7826035 0.7963157 0.9164202 0.9737972
## Cluster 3   50 0.8413450 0.9541261 0.9645018 0.9763228 0.9850474 0.9995473
## 
## Dunn's Fuzziness Coefficients:
## dunn_coeff normalized 
##  0.7833975  0.6750962 
## 
## Within cluster sum of squares by cluster:
##        1        2        3 
## 27.05750 36.81767 15.15100 
## (between_SS / total_SS =  88.04%) 
## 
## Available components: 
##  [1] "u"          "v"          "v0"         "d"          "x"         
##  [6] "cluster"    "csize"      "sumsqrs"    "k"          "m"         
## [11] "iter"       "best.start" "func.val"   "comp.time"  "inpargs"   
## [16] "algorithm"  "call"

All available components of the ‘ppclust’ object are listed at the end of summary. These elements can be accessed using as the attributes of the object. For example, the execution time of the run of FCM is accessed as follows:

res.fcm$comp.time

## [1] 0.506

2.2 Run FCM with Multiple Starts

In order to find an optimal solution, the function fcm can be started for multiple times. As seen in the following code chunk, the argument nstart is used for this purpose.

res.fcm <- fcm(x, centers=3, nstart=5)

When the multiple start is performed, either initial cluster prototypes or initial membership degrees could be wanted to keep unchanged between the starts of algorithm. In this case, the arguments fixcent and fixmemb are used for fixing the initial cluster prototypes matrix and initial membership degrees matrix, respectively.

res.fcm <- fcm(x, centers=3, nstart=5, fixmemb=TRUE)

Both the prototypes and memberships cannot be fixed at the same time.

2.2.1 Display the best solution

The clustering result contains some outputs providing information about some components such as the objective function values, number of iterations and computing time obtained with each start of the algorithm. The following code chunk demonstrates these outputs.

res.fcm$func.val

## [1] 60.50571 60.50571 60.50571 60.50571 60.50571

res.fcm$iter

## [1] 46 51 47 48 44

res.fcm$comp.time

## [1] 0.486 0.510 0.469 0.512 0.442

Among the outputs from succesive starts of the algorithm, the best solution is obtained from the start giving the minimum value of the objective function, and stored as the the final clustering result of the multiple starts of FCM.

res.fcm$best.start

## [1] 1

2.2.2 Display the summary of clustering results

As described for the single start of FCM above, the result of multiple starts of FCM is displayed by using summary method of the ‘ppclust’ package.

summary(res.fcm)

## Summary for 'res.fcm'
## 
## Number of data objects:  150 
## 
## Number of clusters:  3 
## 
## Crisp clustering vector:
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [75] 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 2 2 2 2 3 2 2 2 2
## [112] 2 2 3 2 2 2 2 2 3 2 3 2 3 2 2 3 3 2 2 2 2 2 3 2 2 2 2 3 2 2 2 3 2 2 2 3 2
## [149] 2 3
## 
## Initial cluster prototypes:
##           Sepal.Length Sepal.Width Petal.Length Petal.Width
## Cluster 1          5.6         3.0          4.1         1.3
## Cluster 2          6.3         3.3          6.0         2.5
## Cluster 3          5.4         3.9          1.3         0.4
## 
## Final cluster prototypes:
##           Sepal.Length Sepal.Width Petal.Length Petal.Width
## Cluster 1     5.003966    3.414089     1.482816   0.2535463
## Cluster 2     6.775011    3.052382     5.646782   2.0535467
## Cluster 3     5.888932    2.761069     4.363952   1.3973150
## 
## Distance between the final cluster prototypes
##           Cluster 1 Cluster 2
## Cluster 2 23.846049          
## Cluster 3 10.818752  2.946292
## 
## Difference between the initial and final cluster prototypes
##           Sepal.Length Sepal.Width Petal.Length Petal.Width
## Cluster 1   -0.5960340   0.4140889   -2.6171845  -1.0464537
## Cluster 2    0.4750112  -0.2476177   -0.3532182  -0.4464533
## Cluster 3    0.4889324  -1.1389306    3.0639516   0.9973150
## 
## Root Mean Squared Deviations (RMSD): 2.645823 
## Mean Absolute Deviation (MAD): 15.84692 
## 
## Membership degrees matrix (top and bottom 5 rows): 
##   Cluster 1   Cluster 2   Cluster 3
## 1 0.9966236 0.001072034 0.002304380
## 2 0.9758525 0.007497947 0.016649509
## 3 0.9798259 0.006414579 0.013759500
## 4 0.9674274 0.010107523 0.022465031
## 5 0.9944704 0.001767935 0.003761709
## ...
##      Cluster 1 Cluster 2 Cluster 3
## 146 0.01126223 0.8823507 0.1063871
## 147 0.02579593 0.4666788 0.5075252
## 148 0.01211367 0.8314467 0.1564396
## 149 0.02158126 0.7893823 0.1890364
## 150 0.02691888 0.3913000 0.5817811
## 
## Descriptive statistics for the membership degrees by clusters
##           Size       Min        Q1      Mean    Median        Q3       Max
## Cluster 1   50 0.8413450 0.9541261 0.9645018 0.9763228 0.9850474 0.9995473
## Cluster 2   40 0.5006317 0.7807561 0.8351480 0.8604619 0.9122633 0.9888134
## Cluster 3   60 0.5075252 0.6697398 0.7826035 0.7963157 0.9164202 0.9737972
## 
## Dunn's Fuzziness Coefficients:
## dunn_coeff normalized 
##  0.7833975  0.6750962 
## 
## Within cluster sum of squares by cluster:
##        1        2        3 
## 15.15100 27.05750 36.81767 
## (between_SS / total_SS =  88.04%) 
## 
## Available components: 
##  [1] "u"          "v"          "v0"         "d"          "x"         
##  [6] "cluster"    "csize"      "sumsqrs"    "k"          "m"         
## [11] "iter"       "best.start" "func.val"   "comp.time"  "inpargs"   
## [16] "algorithm"  "call"

3 VISUALIZATION OF THE CLUSTERING RESULTS

3.1 Pairwise Scatter Plots

There are many ways of visual representation of the clustering results. One common techique is to display the clustering results by using pairs of the features. The ‘plotcluster’ can be used to plot the clustering results as follows:

plotcluster(res.fcm, cp=1, trans=TRUE)

In the plot above, clustering result is shown with pairwise scatter plots by using color palette 1 and transparent colors. The details can be viewed on the package manual for the function ‘plotcluster’.

3.2 Cluster Plot with fviz_cluster

There are some nice versions of the cluster plots which are available in some of the R packages. One of them is the function fviz_cluster of the package ‘factoextra’ (Kassambara & Mundt, 2017). In order to use this function for the fuzzy clustering result, first the fuzzy clustering object of ppclust should be converted to kmeans object by using the ppclust2 of the package ‘ppclust’ as shown in the first line of the code chunk as follows:

res.fcm2 <- ppclust2(res.fcm, "kmeans")
factoextra::fviz_cluster(res.fcm2, data = x, 
  ellipse.type = "convex",
  palette = "jco",
  repel = TRUE)

3.3 Cluster Plot with clusplot

User can also use the function clusplot in the package ‘cluster’ (Maechler et al, 2017) for plotting the clustering results. For this purpose, the fuzzy clustering object of ppclust should be converted to fanny object by using the ppclust2 function of the package ‘ppclust’ as seen in the following code chunk.

res.fcm3 <- ppclust2(res.fcm, "fanny")

cluster::clusplot(scale(x), res.fcm3$cluster,  
  main = "Cluster plot of Iris data set",
  color=TRUE, labels = 2, lines = 2, cex=1)

4 VALIDATION OF THE CLUSTERING RESULTS

Cluster validation is an evaluation process for the goodness of the clustering result. For this purpose, various validity indexes have been proposed in the related literature. Since clustering is an unsupervised learning analysis which does not use any external information, the internal indexes are used to validate the clustering results. Although there are many internal indexes have originally been proposed for working with hard membership degrees produced by the K-means and its variants, most of these indexes cannot be used for fuzzy clustering results. In R environment, Partition Entropy (PE), Partition Coefficient (PC) and Modified Partition Coefficient (MPC) and Fuzzy Silhouette Index are available in ‘fclust’ package (Ferraro & Giordani, 2015), and can be used as follows:

res.fcm4 <- ppclust2(res.fcm, "fclust")
idxsf <- SIL.F(res.fcm4$Xca, res.fcm4$U, alpha=1)
idxpe <- PE(res.fcm4$U)
idxpc <- PC(res.fcm4$U)
idxmpc <- MPC(res.fcm4$U)

cat("Partition Entropy: ", idxpe)

## Partition Entropy:  0.3954916

cat("Partition Coefficient: ", idxpc)

## Partition Coefficient:  0.7833975

cat("Modified Partition Coefficient: ", idxmpc)

## Modified Partition Coefficient:  0.6750962

cat("Fuzzy Silhouette Index: ", idxsf)

## Fuzzy Silhouette Index:  0.8091446

References

Anderson, E. (1935). The irises of the GASPE peninsula, in Bull. Amer. Iris Soc., 59: 2-5.

Bezdek, J.C. (1974). Cluster validity with fuzzy sets. J. Cybern., 3: 58-73.

Bezdek J.C. (1981). . Plenum, NY. https://isbnsearch.org/isbn/0306406713

Ferraro, M.B. & Giordani, P. (2015). A toolbox for fuzzy clustering using the R programming language. Fuzzy Sets and Systems, 279, 1-16. http://dx.doi.org/10.1016/j.fss.2015.05.001

Kassambara, A. & Mundt, F. (2017). factoextra: Extract and Visualize the Results of Multivariate Data Analyses. R package version 1.0.5. https://CRAN.R-project.org/package=factoextra

Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M. & Hornik, K.(2017). cluster: Cluster Analysis Basics and Extensions. R package version 2.0.6. https://CRAN.R-project.org/package=cluster

Partitioning Cluster Analysis Using Fuzzy C-Means

Zeynel Cebeci

2017-11-20