# DB Scan Clustering

Density-based spatial clustering of applications with noise (DBSCAN)

DSBCAN is a density-based clustering algorithm that groups nearest neighbors that satisfies these parameters and criteria

• ε or EPS is a distance parameter that defines the radius to search for nearby neighbors. We can imagine each data point having a circle with radius EPS drawn around it.
• Min Samples or Min Points are the number of minimum points that are within the EPS circle (or within EPS distance) from a point to form a dense region or cluster.
• Algorithm first checks every point whether Min-Sample number of points are present within epsom distance around it to form a cluster. If so, those points form a cluster and become core points
• After the first iteration, cluster members of a core point would have become core points themselves by forming a cluster of their own meeting the criteria. In the next iteration, clusters of directly reachable core points are merged and become a larger cluster. Here, there could be points which are part of a cluster, but do not form a cluster of theirs. Those points are caled 'edge' or 'reachable' points. Points which are not part of any cluster are termed 'outliers' or 'noise points'

In this diagram, minPts = 4. Point A and the other red points are core points, because the area surrounding these points in an ε radius contain at least 4 points (including the point itself). Because they are all reachable from one another, they form a single cluster. Points B and C are not core points, but are reachable from A (via other core points) and thus belong to the cluster as well. Point N is an outlier or noise point that is neither a core point nor directly-reachable.

• Does not require to specify the number of clusters as opposed to K-Means

• Can find arbitrarily-shaped clusters. It can even find a cluster completely surrounded by (but not connected to) a different cluster. Due to the MinPts parameter, the so-called single-link effect (different clusters being connected by a thin line of points) is reduced. Contrast this with single linkage hierarchical clustering

• Has a notion of noise, and is robust to outliers.

• DBSCAN requires just two parameters and is mostly insensitive to the ordering of the points. (However, points sitting on the edge of two different clusters might swap cluster membership if the ordering of the points is changed.)

• The parameters minPts and ε can be set by a domain expert, if the data is well understood.

• DBSCAN is not entirely deterministic. Border points that are reachable from more than one cluster can be part of either cluster, depending on the order the data are processed.

• The quality depends on the distance measure used , like Euclidean. Distance measures suffer from 'Curse of Dimensionality', making it difficult to find an appropriate value for ε.

As general rule, minPts selected as >=D+1, where D is number of dimensions(Features)

• Cannot choose minPts-ε combination appropriately where clusters have large differences in densities

# Revise Conditional Probability and Bayes Theorem

Conditional probability of events A and B P(A\mid B) = \frac{P(A\cap B)}{P(B)}

Bayes Theorem

\begin{aligned}
P(A \mid B) &= \frac{P(A\cap B)}{P(B)}, \text{ if } P(B) \neq 0, \\
P(B \mid A) &= \frac{P(B\cap A)}{P(A)}, \text{ if } P(A) \neq 0, \\
\Rightarrow P(A\cap B) &=  P(A\mid B)\times P(B)=P(B\mid A)\times P(A), \\
\Rightarrow P(A \mid B) &=  \frac{P(B \mid A)  \times P(A)} {P(B)}, \text{ if } P(B) \neq 0.
\end{aligned}

A couple has two children, the older of which is a boy. What is the probability that they have two boys?

\begin{aligned}
P(B) &= Older one is boy \\
P(A) &= Two boys \\
\\
\\
P(A|B) &= \frac{P(B/A) P(A)}{P(B)} \\
&=\frac{P(older~one~is~boy~given~two~boys)  * P(having~two~boys)}{P(older~one~is~boy)} \\
Given Fact, \\
P(older~one~is~boy~given~two~boys) &= 1 \\
P(A|B) &= \frac{1 * 1/2*1/2}{1/2} \\
&= 1/2

\end{aligned}

A couple has two children, one of which is a boy. What is the probability that they have two boys?

\begin{aligned}
A &= having two boys\\
B &= one child is a boy\\
\\
\\
P(one~child~boy) &= 1-P(both~children~not~boys)\\
&= 1- 1/4 \\
&= 3/4\\

P(having~two~boys~given~one~child~boy) &= \frac{P(one~child~is~boy~gicen~both~children~boys) * P(both~children~boys)}{P(one~child~boy)} \\
&=\frac{1 * 1/4}{3/4} \\
&= 1/3\\
\end{aligned}

A family has two children. Given that one of the children is a boy, and that he was born on a Tuesday, what is the probability that both children are boys?

\begin{aligned}
P(BB) &= Probablity~of~having~both~boys~of~two~children \\
P(B) &= Probablity~of~having~one~boy~of~two~children\\
P(BT) &= probablity~of~having~a~boy~born~on~Tuesday \\
\\
P(BB|BT) &= \frac{P(BT|BB) *P(BB)}{ P(BT)} \\
P(BT|BB) &=P(having~boy~born~on~Tuesday~given~both~children~are~boys) \\
P(BT|BB) &=P(Either~of~two~can~be~boy~~born~on~Tuesday~given~both~boys) \\
\\
P(BT|BB) &= 1- P(boys~born~on~other~days)\\
&=1-6/7*6/7 \\
&= 1-36/49 = 13/49\\
\\
P(BB) &= 1/2*1/2 = 1/4 \\
\\
P(BT) &= P(one~of~the~child~is~a~boy~born~on~Tuesday) \\
&=1-P((both~boys~born~on~non-Tuesdays) + both~girl) \\

&= 1- 13/14 * 13/14 \\
&= 1- 169 /196 = 27/196\\
\\

P(BB|BT) &= P(BT|BB) *P(BB)/ P(BT) \\
&= 13/49 * 1/4  /  27/196 \\
&=13/49 * 1/4 * 196 /27 \\
&= 13/27
\end{aligned}

Balls numbered 1 through 20 are placed in a bag. Three balls are drawn out of the bag without replacement. What is the probability that all the balls have odd numbers on them

Three balls drawn with out replacement

Total Probability = P(1st ball) P(2nd ball) P(3rd ball)

= 10/20 9/19 8/18

# Revise Permutations and Combinations

In how many possible ways can you write 1800 as a product of 3 positive integers a, b, c ?

• Prime Factorization of 1800 - 2^3 * 3^2 * 5^2

• Distribute power of prime factors among digits a, b,c as asked. For example

If we distribute 2^4 as (2,2,0) and, lets say a,b,c with powers of other prime factors 3 and 5 are (1, 9,25) ,then, with powers of 2 distributed as (2,2,0) (total 4), becomes

(2^2 *1, 2^2 * 9, 2^0 * 25) = (4, 36,25 )= 4*36*25 = 1800

Take another example

Distribute as (0,3,1)

(2^0 * 1, 2^3 * 9, 2^1 * 25) = (1, 72,25) = 1800

  2^4 can be distibuted like the followings among 3 numbers (a,b,c). Smilarly powers of other prime factors. Together the product is 1800
4,0,0
0,4,0
0,0,4
3,1,0
3,0,1
1,3,0
1,0,3
......

2,2,0
2,0,2
0,0,2
2,1,1
1,2,1
1,1,2
...etc. etc


So problem becomes distributing power r (4 here for prime factor 2) among n (3 here) numbers.

• Distribute r items among n people
D_(r,n)=\frac{(r+ (n-1))!}{r! (n-1)!}

Find this sequence of combinations D_(r,n) in Pascals triangle

Answer is D_{(3,3)} * D_{(2,3)} * D_{(2,3)} = 10 * 6 * 6

Ans: 360

How many ways letters in word ARMOUR can be arranged

n = 6

repeating letters = R -2 times, Combinations of them are indistinguishable so will have to reduce from total permutations

So net permutations = \frac{n!}{r!} = 720 / 2 = 360

Out of 7 consonants and 4 vowels, how many words of 3 consonants and 2 vowels can be formed ?

Number of ways 3 consonants and 2 vowels \ can be selected =

here combination is taken because order of letters not important. However order is important when you form words with these selected letters

=\frac{7!}{(7-3)! 3!} * \frac{4!}{(4-2)! 2!}  \\
=35 * 6\\
= 210 \\

Permutations of words with 3 consonants and 2 vowels

=5! = 120

So total combinations = 120 * 210 = \underline{25200}

How many 3-digit numbers can be formed from the digits 2, 3, 5, 6, 7 and 9, which are divisible by 5 and none of the digits is repeated ?

Divisible by 5 means last digit is 5.

For remaining 2 digits, total permutations without repeat =

= {^5\!P_2}

=\underline20

Evaluate permutation equation 59P3 ?

= 59 X 58 X 57

How many words can be formed from the letters of the word 'AFTER', so that the vowels never come together ?

For 5 letters, total permutations 5!

Consider A and E as one letter , ie. total 4 letters and then permutations would be 4!

AE can have two arrangements, AE and EA= 2!

So total permutations with AE and EA considered one letter = 2 * 4!

So total permutations where A and E doesn't come together

= 5! - 2*4! = 72

A box contains 4 red, 3 white and 2 blue balls. Three balls are drawn at random. Find out the number of ways of selecting the balls of different colors ?

Since there are three colors and three draws, different color should come in each draw, so total number ways three different colors can be draws in three draws = 4 3 2 = 24

In how many different ways the letters of the word "MATHEMATICS" can be arranged so that the vowels always come together ?

Total 11 letters

Vowels - AEAI. A repeats

Consonants M and T repeats

Consider vowels AEAI as one letters =V, since it is asked to find arrangements where they come together. However A repeats so, combination needs to divide by 2!

Total number of letters with AEAI together as one letter = 8. Total arrangements = 8!. However M and T repeats . So arrangements become 8!/(2!*2!)

Now with AEAI, they can have their own arrangements, still being together. So those arrangements are 4!/2! (since A repeats)

total permutations= 8! x 4!/8=7! * 4! = 120960

A box contains 2 white balls, 3 black balls and 4 red balls. In how many ways can 3 balls be drawn from the box, if at least one black ball is to be included in the draw ?

Total Combinations of 9 balls into 3= {}^9C_3

Total combination where black ball doesn't show up = {}^{(number of other balls)}C_3 = []^6C_3

Number of combinations where at least one black ball shows up = {}^9C_3 - {}^6C_3

= 84-20 = 64

How many groups of 6 persons can be formed from 8 men and 7 women ?

{}^{15}C_3

8 men entered a lounge simultaneously. If each person shakes hand with the other, then find the total no. of handshakes?

shake hand is group of 2

No order, A shakes B same as B shakes A

So {}^8C_2

In how many different ways, can the letters of the word 'INHALE' be arranged ?

INHALE has 6 letters, none repeating.

Total arrangements = 6!

In how many different ways, can the letters of the word 'BANKING' be arranged ?

BANKING has 7 letters, N repeating

Total Arrangements =

 \frac{7!}{2!}

In a meeting between two countries, each country has 12 delegates. All the delegates of one country shake hands with all delegates of the other country. Find the number of handshakes possible ?

all x all = 12 x 12

How many can straight lines be drawn from 15 non-collinear points ?

Combination of 2 points. And a line cant be drawn from a point to itself

={}^{15}C_2 - 15

=105

There are 25 students in a class. Find the number of ways in which a committee of 3 students is to be formed?

Combination as order is not important

{}^{25}C_3

In how many ways can 8 Indians and, 4 American and 4 Englishmen can be seated in a row so that all person of the same nationality sit together?

8I, 4A, 4E

Consider each group as whole and find their seating

3 groups, order important, 3P3 = 3!

In each group, you can have arrangements, order

Total =3! 8! 4! * 4!

How many different words can be formed using all the letters of the word ALLAHABAD?
(a) When vowels occupy the even positions.
(b) Both L do not occur together.

9 letters,

Repeats A -4, L - 2,

Vowels = A only, So only one vowel combination

Remaining 5 letters, and arrangements = 5!

Consider both L's together as single letter, then arrangements = 4!

So arrangements where both Ls don't occur together = 5!-4!

120 - 24= 96

# K-Means Clustering

Clustering is about grouping the data points based on similarity or dissimilarity among them. Similarity is defined for a dataset in terms of a set of properties of datapoints, for example age, location, salary. Measure of similarity is often experessed as a distance like Euclidean distance, manhattan distance, cosine distance, poisson correlation etc. Correlation alorithm then takes this similarity measure and groups the datapoints into clusters.

• No predictor variables or output variables
• No labels
• Unsupervised

K-Means/Mode/Prototype and Hierarchical clsustering are common clustering methods. Clustering is the method used in segmentation in business terms, like behavioral segmentation, attitudinal segmentation or demographic segmentation

## Similarity Measures

### Euclidean Distance

Geometric distance between two points

If two data points X and Y have n properties in similarity measure then Euclidean distance between them is

X=(X_1,X_2,X_3,...X_n)

Y=(Y_1,Y_2,Y_3,....Y_n)

D = \sqrt{( X_1-Y_1)^2 + (X_2-Y_2)^2+...(X_n-Y_n)^2}

### Centeroid

Centeroid is the center point of the clusters in terms of mean weight of their similarity measures. From above example,

X=(X_1,X_2,X_3,...X_n)

Y=(Y_1,Y_2,Y_3,....Y_n)

Centeroid =(\frac{X_1+Y_1}{2},\frac{X_2+Y_2}{2}, \frac{X_3+Y_3}{2}....,\frac{X_n+Y_n}{2} )

## K-Means Algorithm

• Start with k random(or chosen) centeroids and assign each element to a cluster where distance to the centeroid is minimum - Assign step
• Recompute centeroids in each cluster from the assigned points and reassign elemnts by repeating step1( optimization step)
• Continue till the centerioids dont change

In Optimization step, K-Means tried to minimize the distance from members to the mean of the clusters. Cost function can be written as

J = \sum_{i=1}^{n}||X_{i}-\mu _{k(i)}||^2

Rearranging for K clusters

=\sum_{k=1}^{K}\sum_{i\epsilon C_k{} }^{}||X_{i}-\mu _{k}||^2

#### K++ Algorithm

It is important to choose the initial clusters wisely so that algorithm converges on an optimal solution. In K++, initial centeroids are chosed based on an algorithm that tries to initialize centeroid that are far apart from each other. Steps:

1. Select the first centroid randomly from the data points.
2. For each data point compute its distance from the nearest, previously chosen centroid.
3. Select the next centroid from the data points such that the probability of choosing a point as centroid is directly proportional to its distance from the nearest, previously chosen centroid. (i.e. the point having maximum distance from the nearest centroid is most likely to be selected next as a centroid)
4. Repeat steps 2 and 3 untill k centroids have been sampled

### Important Considerations for K_Means Clustering

• Number of required clusters,k, is decided outside the algorithm. This means it is an input to running the algorithm
• Algorithm may find different clusters if the initial centerioids are different
• Algorithm is sensitive to outliers( distance and mean are affected)
• Input needs to be scaled to same scale. Non-standard data can influence distance measure
• Categorical input cannot work with K_means. This because mean doesnt make sense in catagorical variables. Use K-Mod instead
• Convergence of algorithm may take many iterations an may not converge at all. A limit on number of iterations should be set when running the algorithm

### Cluster Tendency and Hopkins Statistics

Before applying the K-Mean algorithm, it is prudent to check if the data actually has some meaningful clusters in it. A metric to evaluate this is called Cluster Tendency. Hopkins statistics is used to measure cluster tendency. Interpretation of Hopkins statistic:

• 0-0.3 - dataset regularly placed
• around 0.5 - 03-0.7 - random dataset
• 0.7-1 - suitable for clusters

## KMeans Analysis in Python

K-Means is very popular with segmentation analysis like RFM(Recency, Monetary Value and Frequesncy) of retail customers

#### Prepare Data

Create RFM variables

#Monetary value or total amount
amt_df =df.groupby('Customer_id')['purchase_amount'].sum()

#Frequency
frq_df=df.groupby('Customer_id')['purchase_amount'].count()

#Recency or days since last purchase
last_purchase_date_in_dataset = max(df['purshase_date'])
df['days_since_last_purchase'] = last_purchase_date_in_dataset - df['purchase_date']
recency_df = df.groupby('customer_id')['days_since_last_purchase'].min().dt.days

# Merge all
rfm_df = pd.merge([amt_df],frq_df, recency_df, on 'cusomer_id')

#### Remove Outliers

Q1 = rfm_df['purchase_amount'].quantile(0.5)
Q3 = rfm_df['purchase_amount'].quantile(0.95)
IQR =Q3-Q1
MIN = Q1-1.5 * IQR
MAX = Q3+1.5 * IQR
rfm_df = rfm_df[ (rfm_df['purchase_amount']>= MIN) & (rfm_df['purchase_amount']<= MAX)]

#### Scale Data

scaler = scikit.StandardScaler()
scaled_data =scaler.fit_transform(rfm_df)
rfm_df = pd.DataFrame(scaled_data, columns = ['amount', 'frequency','recency'])

### Evaluation and Choosing K

#### Elbow Method - Sum of Square Distance Analysis (SSD)

• Inertia or SSD is calculated as

\sum_{i=0}^n(X_{i}-\overline{X})^2

X_i is a point in the cluster and \bar{X} is the centeroid

• Distortion: It is calculated as the average of the squared distances from the cluster centers of the respective clusters.

Fit K-means for different cluster sizes and plot SSD of each set and then decide how many clusters are optimal

ssd=[]
different_cluster_counts=[2,3,4,5,6,7,8]
for cluster_count in different_cluster_counts:
kmeans=Kmeans(n_clusters=cluster_count, max_iter=50).fit(rfm_df)
ssd.append(kmean.inertia_)

![](/home/seegler/git/jupyter/pgdds-iitb/course4/3-clusteriung/assets/Screenshot from 2020-08-17 12-42-19.png)

SSD decreases as the number of clusters increases. However rate of decrease flattens as cluster count becomes too big and too many clusters also doesnt make business sense. Select a size at the elbow that is acceptable.

#### Silhoutte Score

The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). It can be used to study the separation distance between the resulting clusters.

\text{silhouette score}=\frac{p-q}{max(p,q)}

p is the mean distance to the points in the nearest cluster that the data point is not a part of (Separation)

q is the mean intra-cluster distance to all the points in its own cluster (Cohesion).

• Score ranges from 1 to -1
• Score 1 when p is too high compred to q, means, inter-cluster distance is very high which is desirable
• Score -1when q is large compred to p, means, inter-cluster distance is too high and the point desnt belong to the current cluster
different_cluster_counts=[2,3,4,5,6,7,8]
for cluster_count in different_cluster_counts:
kmeans=Kmeans(n_clusters=cluster_count, max_iter=50).fit(rfm_df)
print(f"cluser:{cluster_count}, silhoutte_score:{silhouette_score(rfm_df, kmeans.labels_)}")

Choose a cluster count which gives maximum or high silhoutte score

#### Assign cluser labels to data points

For the chosen cluster count K from above methods, run K-means and find cluster labels . Cluster labels indicate which cluster each data point belong to

cluster_labels = KMeans(n_clusters=5, max_iter=50).fit(rfm_df).lables_
rfm_df['cluster_label'] = cluster_labels
print(rfm_df['customer_id', 'cluster_label'])

#### Boxplot R,F, M for each cluster and evaluate

Boxplot shows the seperation of clusters on R,F and M properties and helps their business interpretation

sns.boxplot(x='cluster', y='amount', data=rfm_df)
sns.boxplot(x='cluster', y='frequency', data=rfm_df)
sns.boxplot(x='cluster', y='recency', data=rfm_df)

Box plot may have outliers that can make the boxes too tiny. In such a case, remove the outliers from dataset and rerun the algorithm and then plot

# Right Charts for Visualization

Using the right chart to support the narrative is very important for conveying the insights of a data analysis to the stake holders. A chart is often chosen based on what to convey, whether it is a relationship, distribution, comparison or composition of data. Type, size and structure of the data also play a crucial role deciding what charts to use. Factors like time, interactiveness add to the mix. Here is a brief survey of the thought process behind choosing the right chart.

Right Charts for Visualization PDF