Clustering is about grouping the data points based on similarity or dissimilarity among them. Similarity is defined for a dataset in terms of a set of properties of datapoints, for example age, location, salary. Measure of similarity is often experessed as a distance like Euclidean distance, manhattan distance, cosine distance, poisson correlation etc. Correlation alorithm then takes this similarity measure and groups the datapoints into clusters.
 No predictor variables or output variables
 No labels
 Unsupervised
KMeans/Mode/Prototype and Hierarchical clsustering are common clustering methods. Clustering is the method used in segmentation in business terms, like behavioral segmentation, attitudinal segmentation or demographic segmentation
Similarity Measures
Euclidean Distance
Geometric distance between two points
If two data points X and Y have n properties in similarity measure then Euclidean distance between them is
X=(X_1,X_2,X_3,...X_n)
Y=(Y_1,Y_2,Y_3,....Y_n)
D = \sqrt{( X_1Y_1)^2 + (X_2Y_2)^2+...(X_nY_n)^2}
Centeroid
Centeroid is the center point of the clusters in terms of mean weight of their similarity measures. From above example,
X=(X_1,X_2,X_3,...X_n)
Y=(Y_1,Y_2,Y_3,....Y_n)
Centeroid =(\frac{X_1+Y_1}{2},\frac{X_2+Y_2}{2}, \frac{X_3+Y_3}{2}....,\frac{X_n+Y_n}{2} )
KMeans Algorithm
 Start with k random(or chosen) centeroids and assign each element to a cluster where distance to the centeroid is minimum  Assign step
 Recompute centeroids in each cluster from the assigned points and reassign elemnts by repeating step1( optimization step)
 Continue till the centerioids dont change
In Optimization step, KMeans tried to minimize the distance from members to the mean of the clusters. Cost function can be written as
J = \sum_{i=1}^{n}X_{i}\mu _{k(i)}^2
Rearranging for K clusters
=\sum_{k=1}^{K}\sum_{i\epsilon C_k{} }^{}X_{i}\mu _{k}^2
K++ Algorithm
It is important to choose the initial clusters wisely so that algorithm converges on an optimal solution. In K++, initial centeroids are chosed based on an algorithm that tries to initialize centeroid that are far apart from each other. Steps:
 Select the first centroid randomly from the data points.
 For each data point compute its distance from the nearest, previously chosen centroid.
 Select the next centroid from the data points such that the probability of choosing a point as centroid is directly proportional to its distance from the nearest, previously chosen centroid. (i.e. the point having maximum distance from the nearest centroid is most likely to be selected next as a centroid)
 Repeat steps 2 and 3 untill k centroids have been sampled
Important Considerations for K_Means Clustering
 Number of required clusters,k, is decided outside the algorithm. This means it is an input to running the algorithm
 Algorithm may find different clusters if the initial centerioids are different
 Algorithm is sensitive to outliers( distance and mean are affected)
 Input needs to be scaled to same scale. Nonstandard data can influence distance measure
 Categorical input cannot work with K_means. This because mean doesnt make sense in catagorical variables. Use KMod instead
 Convergence of algorithm may take many iterations an may not converge at all. A limit on number of iterations should be set when running the algorithm
Cluster Tendency and Hopkins Statistics
Before applying the KMean algorithm, it is prudent to check if the data actually has some meaningful clusters in it. A metric to evaluate this is called Cluster Tendency. Hopkins statistics is used to measure cluster tendency. Interpretation of Hopkins statistic:
 00.3  dataset regularly placed
 around 0.5  030.7  random dataset
 0.71  suitable for clusters
KMeans Analysis in Python
KMeans is very popular with segmentation analysis like RFM(Recency, Monetary Value and Frequesncy) of retail customers
Prepare Data
Create RFM variables
#Monetary value or total amount
amt_df =df.groupby('Customer_id')['purchase_amount'].sum()
#Frequency
frq_df=df.groupby('Customer_id')['purchase_amount'].count()
#Recency or days since last purchase
last_purchase_date_in_dataset = max(df['purshase_date'])
df['days_since_last_purchase'] = last_purchase_date_in_dataset  df['purchase_date']
recency_df = df.groupby('customer_id')['days_since_last_purchase'].min().dt.days
# Merge all
rfm_df = pd.merge([amt_df],frq_df, recency_df, on 'cusomer_id')
Remove Outliers
Q1 = rfm_df['purchase_amount'].quantile(0.5)
Q3 = rfm_df['purchase_amount'].quantile(0.95)
IQR =Q3Q1
MIN = Q11.5 * IQR
MAX = Q3+1.5 * IQR
rfm_df = rfm_df[ (rfm_df['purchase_amount']>= MIN) & (rfm_df['purchase_amount']<= MAX)]
Scale Data
scaler = scikit.StandardScaler()
scaled_data =scaler.fit_transform(rfm_df)
rfm_df = pd.DataFrame(scaled_data, columns = ['amount', 'frequency','recency'])
Evaluation and Choosing K
Elbow Method  Sum of Square Distance Analysis (SSD)

Inertia or SSD is calculated as
\sum_{i=0}^n(X_{i}\overline{X})^2
X_i
is a point in the cluster and \bar{X}
is the centeroid

Distortion: It is calculated as the average of the squared distances from the cluster centers of the respective clusters.
Fit Kmeans for different cluster sizes and plot SSD of each set and then decide how many clusters are optimal
ssd=[]
different_cluster_counts=[2,3,4,5,6,7,8]
for cluster_count in different_cluster_counts:
kmeans=Kmeans(n_clusters=cluster_count, max_iter=50).fit(rfm_df)
ssd.append(kmean.inertia_)
![](/home/seegler/git/jupyter/pgddsiitb/course4/3clusteriung/assets/Screenshot from 20200817 124219.png)
SSD decreases as the number of clusters increases. However rate of decrease flattens as cluster count becomes too big and too many clusters also doesnt make business sense. Select a size at the elbow that is acceptable.
Silhoutte Score
The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). It can be used to study the separation distance between the resulting clusters.
\text{silhouette score}=\frac{pq}{max(p,q)}
p
is the mean distance to the points in the nearest cluster that the data point is not a part of (Separation)
q
is the mean intracluster distance to all the points in its own cluster (Cohesion).
 Score ranges from 1 to 1
 Score 1 when p is too high compred to q, means, intercluster distance is very high which is desirable
 Score 1when q is large compred to p, means, intercluster distance is too high and the point desnt belong to the current cluster
different_cluster_counts=[2,3,4,5,6,7,8]
for cluster_count in different_cluster_counts:
kmeans=Kmeans(n_clusters=cluster_count, max_iter=50).fit(rfm_df)
print(f"cluser:{cluster_count}, silhoutte_score:{silhouette_score(rfm_df, kmeans.labels_)}")
Choose a cluster count which gives maximum or high silhoutte score
Assign cluser labels to data points
For the chosen cluster count K from above methods, run Kmeans and find cluster labels . Cluster labels indicate which cluster each data point belong to
cluster_labels = KMeans(n_clusters=5, max_iter=50).fit(rfm_df).lables_
rfm_df['cluster_label'] = cluster_labels
print(rfm_df['customer_id', 'cluster_label'])
Boxplot R,F, M for each cluster and evaluate
Boxplot shows the seperation of clusters on R,F and M properties and helps their business interpretation
sns.boxplot(x='cluster', y='amount', data=rfm_df)
sns.boxplot(x='cluster', y='frequency', data=rfm_df)
sns.boxplot(x='cluster', y='recency', data=rfm_df)
Box plot may have outliers that can make the boxes too tiny. In such a case, remove the outliers from dataset and rerun the algorithm and then plot