Understanding K-means Clustering in Machine Learning

Before diving straight into studying the algorithm let us have some background about the algorithm. K-means clustering is a Machine Learning Algorithm. Precisely, machine learning algorithms are broadly categorized as supervised and unsupervised. Unsupervised learning is further classified as a transformation of the data set and clustering. Clustering further is of several types and K-means belong to hierarchical clustering.

Let us have an overview of these concepts before beginning to study the algorithm in detail.

What is Unsupervised Learning?

The machine is trained on unlabelled data without any guidance it should discover hidden patterns in the data. Unsupervised learning algorithms perform complex tasks but can be more doubtful as compared with the natural learning method. Unsupervised methods allow finding features that are useful for categorization purposes. Also, all unknown patterns can be found using unsupervised learning. The problems of unsupervised learning are categorized as clustering and association problems.

Check also the difference between unsupervised learning and supervised learning.

Let us now see what is clustering:

What is Clustering?

Let's consider a dataset of points:

dataset of points

We assume that it's possible to find a criterion (not unique) so that each sample can be associated with a specific group:

Conventionally, each group is called a cluster and the process of finding the function G is called clustering. Clustering is considered as an important concept to deal with the finding of a structure or a pattern in a bunch of unknown data. Clustering algorithms process the data and discover natural clusters (groups) if they are present in the data. It is up to to the user to adjust the number of clusters an algorithm should recognize as the algorithm gives the power to modify the granularity of the group.

There are various types of clustering you can use:

Partitioning: The data is organised such that a single data can be a part of one cluster only.
Agglomerative: Every data is a cluster in this technique.
Overlapping: In this technique, fuzzy sets are used to cluster data in this technique.
Probabilistic: Probability distribution is used in this technique to make the cluster.
Hierarchical: This algorithm makes a hierarchy of clusters. It begins with all the data allotted to the cluster of their own. Then two clusters are going to be in the same cluster the algorithm comes to end when just a single cluster is left.
K-mean Clustering: K refers to the dull clustering algorithm that helps to find the highest value for every problem. In this method of clustering, the required number of clusters is selected, the data points are clustered into k-groups. A bigger k-means smaller groups with more granularity whereas a smaller k means bigger groups with a reduced amount of granularity.

Let us now study the k-means clustering algorithm in detail:

K- Means Clustering Algorithm

The k-means algorithm is based on the initial condition to decide the number of clusters through the assignment of k initial centroids or means:

k initial centroids or means

Then the distance between each sample and each centroid is computed and the sample is assigned to the cluster where the distance is minimum. This approach is often called minimizing the inertia of the clusters, which is defined as follows:

Eq 10

The process is iterative, once all the samples have been processed, a new set of centroids K is computed and all the distances are recomputed. The algorithm stops when the desired tolerance is reached, or in other words, when the centroids become stable and, therefore, the inertia is minimized.

Algorithmic Steps

Let X = { x1, x2, x3, ……, xn} be the set of data points an

μ = {μ1, μ2, μ3,........,μn} be the centres.

Select “C” cluster centres randomly.
Calculate the distance between each data point and cluster centres.
The data point with minimum distance from the cluster centre is assigned to the cluster centre.
Rectangle the new cluster centre with the formula:

Eq 2

where ci represents the number of data points in the ith cluster

Recalculate the distance b/w new obtained cluster centres and data points.
Stop if no data point was reassigned, else repeat from step 3.

Sample Data Set Explaining K-means Clustering

Consider a simple example with a dummy dataset:

from sklearn.datasets import make_blobs
nb_samples = 1000
X, _ = make_blobs(n_samples=nb_samples, n_features=2, centers=3, cluster_std=1.5

In our example, we have three clusters with bidimensional features and a partial overlap due to the standard deviation of each blob. We haven’t use the variable here as we want to generate a set of locally coherent points to try our algorithms:

coherent points 1

In this case, we expect k-means to separate the three groups with minimum error in the X-region bounded between [-5, 0]. Hence, keeping the default values we get:

from sklearn.cluster import KMeans
>>> km = KMeans(n_clusters=3)
>>> km.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001, verbose=0)

>>> print(km.cluster_centers_) 

[[ 1.39014517, 1.38533993]
[ 9.78473454, 6.1946332 ]
[-5.47807472, 3.73913652]]

Reploting the data with three different markers, we verify how k-means successfully separated the data.

k-means

In this case, the separation is easy because k-means is based on euclidean distance, which is radial and so clusters are expected to be convex. The problem can not be solved using this algorithm if all of this doesn’t happen. Mostly, k-means can produce good results even if the convexity is not fully guaranteed, but there are several situations when the expected clustering is impossible and letting k-means finding out the centroid can lead to wrong solutions.

Let us also consider the case of concentric circles, scikit-learn provides a built-in function to generate such datasets:

from sklearn.datasets import make_circles
>>> nb_samples = 1000
>>> X, Y = make_circles(n_samples=nb_samples, noise=0.05)

The plot for concentric circles is shown:

concentric circles is shown

Here we have an internal cluster (blue triangle markers) and an external one (red dot markers). Such sets are not convex, and so it’s impossible for k-means to separate them correctly.

Suppose, we apply the algorithm to two clusters:

>>> km = KMeans(n_clusters=2)
>>> km.fit(X)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=2, n_init=10, n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001, verbose=0)

We get the separation as shown:

Graph

As expected, k-means converges on the two centroids in the middle of the two half-circles, and the resulting clustering is quite different.

K-means Clustering Algorithm Code in Python

df = pd.DataFrame({
    'x': [12, 20, 28, 18, 29, 33, 24, 45, 45, 52, 51, 52, 55, 53, 55, 61, 64, 69, 72],
    'y': [39, 36, 30, 52, 54, 46, 55, 59, 63, 70, 66, 63, 58, 23, 14, 8, 19, 7, 24]
})
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
kmeans.fit(df)
------------------------------------------------------------------------------labels = kmeans.predict(df)
centroids = kmeans.cluster_centers_
fig = plt.figure(figsize=(5, 5))

colors = map(lambda x: colmap[x+1], labels)

plt.scatter(df['x'], df['y'], color=colors, alpha=0.5, edgecolor='k')
for idx, centroid in enumerate(centroids):
    plt.scatter(*centroid, color=colmap[idx+1])
plt.xlim(0, 80)
plt.ylim(0, 80)
plt.show()
------------------------------------------------------------------------------

K-means Clustering Algorithm Code in R

# K-Means Algorithm 
#k=3 # the number of K
max=5000 # the maximum number for generating random points
n=100 # the number of points
maxIter = 10 # maximum number of iterations
threshold = 0.1 #difference of old means and new means
# Randomly generate points in the form of (x,y)
x <- sample(1:max, n)
y <- sample(1:max, n)
# put point into a matrix
z <- c(x,y)
m = matrix(z, ncol=2)
ks <- c(1,2,4,8,10,15,20) # different Ks
for(k in ks)
   myKmeans(m, k, max)
myKmeans <- function(m, k, max)
{
#initialization for k means: the k-first points in the list
x <- m[, 1]
y <- m[, 2]
d=matrix(data=NA, ncol=0, nrow=0)
for(i in 1:k)
    d <-  c(d, c(x[i], y[i]))
init <- matrix(d, ncol=2, byrow=TRUE)
dev.new()
plotTitle <- paste("K-Means Clustering K = ", k)
plot(m, xlim=c(1,max), ylim=c(1,max), xlab="X", ylab="Y", pch=20, 
        main=plotTitle)
par(new=T)
plot(init, pch=2, xlim=c(1,max), ylim=c(1,max), xlab="X", ylab="Y")
par(new=T)
oldMeans <- init
oldMeans 
cl <- Clustering(m, oldMeans)
cl
means <- UpdateMeans(m, cl, k)
thr <- delta(oldMeans, means)
itr <- 1
while(thr > threshold)
{
cl <- Clustering(m, means)
oldMeans <- means
means <- UpdateMeans(m, cl, k)
thr <- delta(oldMeans, means)
itr <- itr+1
}
cl
thr
means
itr
for(km in 1:k)
{
   group <- which(cl == km)
   plot(m[group,],axes=F, col=km, xlim=c(1,max), ylim=c(1,max), pch=20, xlab="X", ylab="Y")
   par(new=T)
}
plot(means, axes=F, pch=8, col=15, xlim=c(1,max), ylim=c(1,max), xlab="X", ylab="Y")
par(new=T)
dev.off()
} # end function myKmeans
#function distance
dist <- function(x,y)
{
 d<-sqrt( sum((x - y) **2 ))
}
createMeanMatrix <- function(d)
{
 matrix(d, ncol=2, byrow=TRUE)
}
# compute euclidean distance
euclid <- function(a,b){
 d<-sqrt(a**2 + b**2)
}
euclid2 <- function(a){
 d<-sqrt(sum(a**2))
}
#compute difference between new means and old means
delta <- function(oldMeans, newMeans)
{
 a <- newMeans - oldMeans
 max(euclid(a[, 1], a[, 2]))
}
Clustering <- function(m, means)
{
  clusters = c()
  n <- nrow(m)
  for(i in 1:n)
  {
    distances = c()
    k <- nrow(means)
    for(j in 1:k)
    {
 di <- m[i,] - means[j,]
 ds<-euclid2(di)
 distances <- c(distances, ds)
    }
    minDist <- min(distances)
    cl <- match(minDist, distances)
    clusters <- c(clusters, cl)    
  }
  return (clusters)
}
UpdateMeans <- function(m, cl, k)
{
 means <- c()
 for(c in 1:k)
 {
    # get the point of cluster c
    group <- which(cl == c)
    # compute the mean point of all points in cluster c
    mt1 <- mean(m[group,1])
    mt2 <- mean(m[group,2])
    vMean <- c(mt1, mt2)
    means <- c(means, vMean)
 }
 means <- createMeanMatrix(means)
 return(means)
}

Challenges of the K-means Clustering Algorithm

1. Different Cluster Size

The common challenge that the algorithm faces is different cluster sizes.

Let us understand this with an example:

Consider an original set of points as shown below:

K1-Original Points

In the original plot, the right and leftmost clusters are of smaller size as compared to the central cluster on applying k-means clustering on this algorithm, the result would be as shown:

K-1 2

2. Different Density of Data Points

Other challenges of the algorithm arise when the densities of the original points are different.

Consider again, a set of original points as shown:

K-2 1

In the plot above, the points in the blue and given clusters are closely packed whereas the points in the red cluster are spread out on applying k-means clustering on these points. We will get the cluster as shown.

K-2 2

We see that the compact points have been assigned to a single cluster, whereas the points that were spread out before and were in the same cluster are assigned to different clusters.

The solution could be using a higher number of clusters, so instead of three clusters (k=10) thus leading to the formation of meaningful clusters.

Applications of K-means Clustering Algorithm

1. Document Classification

This is a very standard classification problem and this algorithm is considered appropriate to solve it. Documents are clustered in multiple categories based on tags, content, topics of the document.

2. Customer Segmentation

Clustering Technique segment customers based on purchase history, interests or activity monitoring thus helping markets to improve their customer base, work on target areas. The classification would help the company target specific clusters of customers.

3. Insurance Fraud Detection

It is possible to isolate new claims by utilizing past historical data on fraudulent claims. Based on historical data clusters can be formed indicating fraudulent.

4. Call Record Data Analysis

CDR is the information captured by telecom companies and is used to understand the segment of customers with respect to their usage of hours.

The information collected via calls, SMS, and the internet provides greater insights about customers needs when used with the demographics of the customer.

5. Cyber Profiling Criminals

The idea of cyber profiling is derived from criminal profiles and in the process of data collection from individuals and groups to identify significant co-relations

Cyber profiling provides information on the investigation division to classify the types of criminals at the crime scene.

Advantages of K-means Clustering Algorithm

Easy to comprehend.
Robust and fast algorithm.
Efficient algorithm with the complexity O(tknd) where:

t: number of iterations.
k: number of centroids (clusters).
n: number of objects.
d: dimension of each object.
Usually, it is k, t, d << n.
It gives the best result when the data sets are distinct and well separated from each other.

Disadvantages of K-means Clustering Algorithm

The algorithm requires the Apriori specification of the number of cluster centres.
The k-means cannot resolve that there are two clusters if there are two highly overlapping data.
The algorithm is not invariant to non-linear transformations, i.e., different representations of data reveal different results.
Euclidean distance measures can unequally weigh underlying factors.
The algorithm fails for categorical data and is applicable only when the mean is defined.
Unable to handle noisy data and outliers.
The algorithm fails for a non-linear data set.

Machine Learning Intro for Python Developers

Conclusion

That brings us to the end of unsupervised learning algorithms, k-means clustering. We have studied the unsupervised technique that is a type of machine learning in which machines are trained using unlabelled data. Furthermore, we discussed clustering which in simpler words is the process of dividing the datasets into groups, consisting of similar data points. It has various uses, popular ones being Amazon’s recommendation system and Netflix’s movie recommendations. Moving on we learned about our main blog topic K-means clustering algorithm, its algorithmic steps and understood it using a dummy dataset. We also implemented the algorithm using its code in Python and R. Lastly, we studied about the challenges of the algorithm followed by its applications, advantages and disadvantages.

You may take an overview of more machine learning algorithms here.

Was this information helpful to you to understand this algorithm? Let us know your feedback!

People are also reading: