Clustering problem on materialize data

Cluster Analysis

Cluster Analysis is (yet another) a data mining problem
Cluster Analysis divides the data into groups (clusters) that are meaningful (usually based on some similarity criteria)

Important fact:

Cluster Analysis groups the data objects based only on information found in the data itself.
The cluster analysis study the data object and its relationship with other data objects
It uses a distance function to classify how close/far two data objects are related to each other
The algorithms operates without the interaction with humans
For this reason, cluster analysis is often called unsupervised classification methods

Clustering: example
- Consider 2 properties of materials.
  The following is a graphical representations of the property values:
- Forming 2 clusters:
- Forming 4 clusters:

The Basic Clustering Algorithms

K-means and K-medians

User specifies a given number of clusters
The algorithm finds centroids (center points) that minimizes the total distance between the centroids and the points in the same cluster as the centroid point

Hierarchical Clustering

DBScan

Clustering algorithm uses density as the clustering measure
Number of clusters is determined automatically by algorithm
Point in low density regions are classified as noise and omitted

This course is not about data mining.
I will only discuss K-means (K-medians) and Hierarchical clustering because the stream clustering algorithm - by Guha et. al. - is an extension of these algorithms.

Clustering Attributes

The data items usually have multiple attribute (property) values
Example: Food
Sample data:

In general:

The attributes (properties) considered in the clustering are: A₁, A₂, ..., A_n
I will use lower case letters to denote the value of the properties of a given item, e.g.:

Commonly used Distance functions in clustering

How similar (close) or different (far) two objects are is defined by a distance function on the clustering attributes

Commonly used distance functions:

Euclidean distance (2-norm):

The distance between two different items X and Y is given by:

-------------------------------------------- Dist(X,Y) = \/ (x₁ - y₁)² + (x₂ - y₂)² + ... + (x_n - y_n)² Graphically in 2-dimension: X |\ | \ | \ -------------------------- | \ \/ (x₁ - y₁)² + (x₂ - y₂)² | \ | \ +-----> Y

The resulting algorithm is called the K-means method

Manhantan distance (1-norm):

The distance between two different items X and Y is given by:

Dist(X,Y) = |(x₁ - y₁)| + |x₂ - y₂)| + ... + |x_n - y_n| Graphically in 2-dimension: |(x₁ - y₁)| X ----------> | | | |x₂ - y₂)| | | V Y

The resulting algorithm is called the K-medians method

Cosine (angle between the vectors):

The distance between two different items X and Y is given by:

Dist(X,Y) = arccos ( (x₁, x₁, ..., x₁) ⊗ (y₁, y₁, ..., y₁) / (|X||Y|) ) ⊗ = the inner product operator |X| = sqrt( x₁² + x₂² + ... + x_n² ) |Y| = sqrt( y₁² + y₂² + ... + y_n² )

Graphically:

This measure is commonly used to measure difference in text documents

The triangular inequality

The triangular inequality is applicable to many distance functions:

1-norm:

x = (x₁, x₂, ..., x_n)
y = (y₁, y₂, ..., y_n)
Triangular inequality holds:

And of course for the Euclidean norm:

x = (x₁, x₂, ..., x_n)
y = (y₁, y₂, ..., y_n)
Triangular inequality holds:

Example K-means clustering algorithm

Problem Description:

Given the following properties of 4 medicines:

Weight index pH value Medicine A 1 1 Medicine B 2 1 Medicine C 4 3 Medicine D 5 4

Graphical representation:

Problem:

Put the data points in 2 clusters (groups)

(Easy examples make things easier to understand :-))

Step 0: Pick initial centroids (picking good starting point is an art...)
Step 1: Find nearest centroid using Euclidean distance
Step 2: Recompute centroid for each group
New centroid (c₁, c₂, ..., c_n) for the cluster C is found through:
Example:
Result:
Repeat Step 1: Find nearest centroid using Euclidean distance
Repeat step 2: Recompute centroid for each group
Repeat Step 1: Find nearest centroid using Euclidean distance
Repeat step 2: Recompute centroid for each group
NO CHANGES
DONE

K-medians clustering
- When using K-medians algorithm, step 1 is modified to:
- Step 2 (recompute centroid for each group) is unchanged !!!
  The new centroid (c₁, c₂, ..., c_n) for the cluster C is found through:

The K-means/K-medians clustering Algorithm

Algorithm:

Select K points as initial centroids; repeat { Form K clusters by assigning each point to its neareast centroid; Recompute the centroid using the new membership of each cluster; } until (centroids do not change)

The cost of the clustering solution
- We can look at the clustering algorithm as a minimalization problem
- The cost of the minimalization problem is:
- The cost of a sloution is thus defined as:

Choosing initial centroids

Methods:

Random...

Farthest points:

Select the first centroid randomly
Select the point farthest from centroid as the second centroid
Select the point farthest from both centroids as the third centroid
And so on.

Finding farthest points is pretty computationally intensive

Time and Space complexity

Space requirement:

You need to store attribute values from:
- The data points
- The centroid points (the centroids may not be equal be any data point !)
Space requirement: O((m+K) n)
- m = number of data points
- K = number of centroid points
- n = number of attributes

Time requirement:

In each loop:
- Find distance of all attributes of each data point to each centroid point:
Total running time:
- I = number of iterations until convergence

Hierarchical Agglomerative Clustering (HAC) Algorithms

A second important class of clustering algorithms is the Hierarchical Agglomerative algorithms

Basic Hierarchical Agglomerative Algorithm:

for ( each x ∈ input set ) { C_x = { x }; // Each data point is in its own cluster } Compute the Proximity Matrix between every 2 cluster repeat { Merge the closest 2 clusters; Update Proximity Matrix; } stop condition (e.g., min. distance > MIN or number clusters = k, etc., etc)

Defining proximity between clusters
- The key operation of the HAC algorithm is computing the proximity between clusters
- Commonly used cluster distance measures:
Example Hierachical Clustering
- Initilial state:
- After one merge operation:
- After two merge operations: