Partitioning methods, such as k-means clustering require the users to specify the number of clusters to be generated.

• fviz_nbclust(): Dertemines and visualize the optimal number of clusters using different methods: within cluster sums of squares, average silhouette and gap statistics.

• fviz_gap_stat(): Visualize the gap statistic generated by the function clusGap() [in cluster package]. The optimal number of clusters is specified using the "firstmax" method (?cluster::clustGap).

fviz_nbclust(x, FUNcluster = NULL, method = c("silhouette", "wss",
"gap_stat"), diss = NULL, k.max = 10, nboot = 100,
verbose = interactive(), barfill = "steelblue",
barcolor = "steelblue", linecolor = "steelblue",
print.summary = TRUE, ...)

fviz_gap_stat(gap_stat, linecolor = "steelblue", maxSE = list(method =
"firstSEmax", SE.factor = 1))

## Arguments

x numeric matrix or data frame. In the function fviz_nbclust(), x can be the results of the function NbClust(). a partitioning function which accepts as first argument a (data) matrix like x, second argument, say k, k >= 2, the number of clusters desired, and returns a list with a component named cluster which contains the grouping of observations. Allowed values include: kmeans, cluster::pam, cluster::clara, cluster::fanny, hcut, etc. This argument is not required when x is an output of the function NbClust(). the method to be used for estimating the optimal number of clusters. Possible values are "silhouette" (for average silhouette width), "wss" (for total within sum of square) and "gap_stat" (for gap statistics). dist object as produced by dist(), i.e.: diss = dist(x, method = "euclidean"). Used to compute the average silhouette width of clusters, the within sum of square and hierarchical clustering. If NULL, dist(x) is computed with the default method = "euclidean" the maximum number of clusters to consider, must be at least two. integer, number of Monte Carlo ("bootstrap") samples. Used only for determining the number of clusters using gap statistic. logical value. If TRUE, the result of progress is printed. fill color and outline color for bars color for lines logical value. If true, the optimal number of clusters are printed in fviz_nbclust(). optionally further arguments for FUNcluster() an object of class "clusGap" returned by the function clusGap() [in cluster package] a list containing the parameters (method and SE.factor) for determining the location of the maximum of the gap statistic (Read the documentation ?cluster::maxSE). Allowed values for maxSE\$method include: "globalmax": simply corresponds to the global maximum, i.e., is which.max(gap) "firstmax": gives the location of the first local maximum "Tibs2001SEmax": uses the criterion, Tibshirani et al (2001) proposed: "the smallest k such that gap(k) >= gap(k+1) - s_k+1". It's also possible to use "the smallest k such that gap(k) >= gap(k+1) - SE.factor*s_k+1" where SE.factor is a numeric value which can be 1 (default), 2, 3, etc. "firstSEmax": location of the first f() value which is not larger than the first local maximum minus SE.factor * SE.f[], i.e, within an "f S.E." range of that maximum. see ?cluster::maxSE for more options

## Value

• fviz_nbclust, fviz_gap_stat: return a ggplot2

## See also

fviz_cluster, eclust

## Examples

set.seed(123)

# Data preparation
# +++++++++++++++
data("iris")
head(iris)#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa
#> 6          5.4         3.9          1.7         0.4  setosa# Remove species column (5) and scale the data
iris.scaled <- scale(iris[, -5])

# Optimal number of clusters in the data
# ++++++++++++++++++++++++++++++++++++++
# Examples are provided only for kmeans, but
# you can also use cluster::pam (for pam) or
#  hcut (for hierarchical clustering)

### Elbow method (look at the knee)
# Elbow method for kmeans
fviz_nbclust(iris.scaled, kmeans, method = "wss") +
geom_vline(xintercept = 3, linetype = 2)
# Average silhouette for kmeans
fviz_nbclust(iris.scaled, kmeans, method = "silhouette")
### Gap statistic
library(cluster)
set.seed(123)
# Compute gap statistic for kmeans
# we used B = 10 for demo. Recommended value is ~500
gap_stat <- clusGap(iris.scaled, FUN = kmeans, nstart = 25,
K.max = 10, B = 10)
print(gap_stat, method = "firstmax")#> Clustering Gap statistic ["clusGap"] from call:
#> clusGap(x = iris.scaled, FUNcluster = kmeans, K.max = 10, B = 10,     nstart = 25)
#> B=10 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
#>  --> Number of clusters (method 'firstmax'): 3
#>           logW   E.logW       gap     SE.sim
#>  [1,] 4.534565 4.753100 0.2185345 0.03145767
#>  [2,] 4.021316 4.489937 0.4686203 0.02397553
#>  [3,] 3.806577 4.297333 0.4907552 0.03038244
#>  [4,] 3.699263 4.141120 0.4418565 0.02263960
#>  [5,] 3.589284 4.049903 0.4606189 0.02153819
#>  [6,] 3.519726 3.967399 0.4476734 0.02451182
#>  [7,] 3.448288 3.899672 0.4513843 0.02816061
#>  [8,] 3.398210 3.846276 0.4480656 0.02557573
#>  [9,] 3.334279 3.800104 0.4658256 0.02313226
#> [10,] 3.250246 3.758406 0.5081600 0.02195875fviz_gap_stat(gap_stat)
# Gap statistic for hierarchical clustering
gap_stat <- clusGap(iris.scaled, FUN = hcut, K.max = 10, B = 10)
fviz_gap_stat(gap_stat)