It’s now time to explore a different domain of machine learning. The popular unsupervised machine learning algorithms: K-Means clustering.
Having discussed in detail supervised machine learning algorithms such as linear regression, decision trees, XGBoost and neural networks, we now shift our focus to one of the most popular unsupervised machine learning algorithms: K-Means clustering.
In unsupervised learning, the data is not labeled. Therefore, we do not know the target variable. We have a bunch of features and instead of trying to predict something, our goal is to group or club similar observations.
Clustering groups observations that have similar properties or characteristics. This helps us to unearth hidden patterns and structures in the data.
Observations within a cluster are more similar to each other than observations that belong to different clusters. Conversely, data points in different clusters should be as different as possible to get the best results.
The K-Means clustering algorithm is an iterative clustering algorithm that assigns each observation in a dataset to exactly one cluster of the K number of clusters that we specify in advance before running the algorithm.
The main objective of the K-Means algorithm is to minimize the sum of squared distances between the observations in a cluster and their respective cluster centroid. The centroid of the cluster is the mean value of all the values in the cluster.
Here we list the steps to demonstrate how the algorithm works behind the scenes:
We are going to implement the K-Means algorithm on the AI & Analytics platform using the Iris data set. This data consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica, and Iris versicolor). It has four features from each sample: length and width of sepals and petals. Although the data set is labeled where we know the target variable, we are going to drop the class column and treat it as an unsupervised machine learning problem.
You can access the complete jupyter notebook with the code here
Our head of data science has created a short and sweet tutorial to give you a walk-through of how to implement the K-Means Clustering algorithm on the AI & Analytics Engine using the Iris data set.
You do not have to write even a single line of code for this. To learn more about getting started with no code data science, have a look at this article!
We explored the K-Means clustering algorithm in this blog which is one of the most intuitive and widely used algorithms in unsupervised machine learning. It is computationally efficient and the results are also easy to visualize. The algorithm was then implemented on the AI & Analytics Engine platform using the Iris data set which took only a few minutes and did not require any coding.
Ready to give K-Means Clustering a try for yourself? Simply create a trial account with the Engine