in clustering, Data Science

DB Scan Clustering

Density-based spatial clustering of applications with noise (DBSCAN)

DSBCAN is a density-based clustering algorithm that groups nearest neighbors that satisfies these parameters and criteria

  • ε or EPS is a distance parameter that defines the radius to search for nearby neighbors. We can imagine each data point having a circle with radius EPS drawn around it.
  • Min Samples or Min Points are the number of minimum points that are within the EPS circle (or within EPS distance) from a point to form a dense region or cluster.
  • Algorithm first checks every point whether Min-Sample number of points are present within epsom distance around it to form a cluster. If so, those points form a cluster and become core points
  • After the first iteration, cluster members of a core point would have become core points themselves by forming a cluster of their own meeting the criteria. In the next iteration, clusters of directly reachable core points are merged and become a larger cluster. Here, there could be points which are part of a cluster, but do not form a cluster of theirs. Those points are caled 'edge' or 'reachable' points. Points which are not part of any cluster are termed 'outliers' or 'noise points'

In this diagram, minPts = 4. Point A and the other red points are core points, because the area surrounding these points in an ε radius contain at least 4 points (including the point itself). Because they are all reachable from one another, they form a single cluster. Points B and C are not core points, but are reachable from A (via other core points) and thus belong to the cluster as well. Point N is an outlier or noise point that is neither a core point nor directly-reachable.

Advantages

  • Does not require to specify the number of clusters as opposed to K-Means

  • Can find arbitrarily-shaped clusters. It can even find a cluster completely surrounded by (but not connected to) a different cluster. Due to the MinPts parameter, the so-called single-link effect (different clusters being connected by a thin line of points) is reduced. Contrast this with single linkage hierarchical clustering

  • Has a notion of noise, and is robust to outliers.

  • DBSCAN requires just two parameters and is mostly insensitive to the ordering of the points. (However, points sitting on the edge of two different clusters might swap cluster membership if the ordering of the points is changed.)

  • The parameters minPts and ε can be set by a domain expert, if the data is well understood.

Disadvantages

  • DBSCAN is not entirely deterministic. Border points that are reachable from more than one cluster can be part of either cluster, depending on the order the data are processed.

  • The quality depends on the distance measure used , like Euclidean. Distance measures suffer from 'Curse of Dimensionality', making it difficult to find an appropriate value for ε.

    As general rule, minPts selected as >=D+1, where D is number of dimensions(Features)

  • Cannot choose minPts-ε combination appropriately where clusters have large differences in densities

Additional Reading

Write a Comment

Comment