clustering

Data Available:

CC GENERAL.csv

Business Context:

A Bank wants to develop a customer segmentation to define marketing strategy. The sample dataset summarizes the usage behaviour of about 9000 active credit card holders during the last 6 months. The file is at a customer level with 18 behavioural variables.

Business Requirements:

Advanced data preparation: Build an enriched customer profile by deriving “intelligent” KPIs such as:

Monthly average purchase and cash advance amount
Purchases by type (one-off, instalments)
Average amount per purchase and cash advance transaction,
Limit usage (balance to credit limit ratio),
Payments to minimum payments ratio etc.
Advanced reporting: Use the derived KPIs to gain insight on the customer profiles.
Identification of the relationships/ affinities between services.
Clustering: Apply a data reduction technique factor analysis for variable reduction technique and a clustering algorithm to reveal the behavioural segments of credit card holders
Identify cluster characteristics of the cluster using detailed profiling.
Provide the strategic insights and implementation of strategies for given set of cluster characteristics.

Data Dictionary:

CUST_ID: Credit card holder ID
BALANCE: Monthly average balance (based on daily balance averages)
BALANCE_FREQUENCY: Ratio of last 12 months with balance
PURCHASES: Total purchase amount spent during last 12 months
ONEOFF_PURCHASES: Total amount of one-off purchases
INSTALLMENTS_PURCHASES: Total amount of installment purchases
CASH_ADVANCE: Total cash-advance amount
PURCHASES_ FREQUENCY: Frequency of purchases (Percent of months with at least one purchase)
ONEOFF_PURCHASES_FREQUENCY: Frequency of one-off-purchases PURCHASES_INSTALLMENTS_FREQUENCY: Frequency of installment purchases

CASH_ADVANCE_ FREQUENCY: Cash-Advance frequency
AVERAGE_PURCHASE_TRX: Average amount per purchase transaction
CASH_ADVANCE_TRX: Average amount per cash-advance transaction
PURCHASES_TRX: Average amount per purchase transaction
CREDIT_LIMIT: Credit limit
PAYMENTS: Total payments (due amount paid by the customer to decrease their statement balance) in the period
MINIMUM_PAYMENTS: Total minimum payments due in the period.
PRC_FULL_PAYMEN: Percentage of months with full payment of the due statement balance
TENURE: Number of months as a customer

Let’s develop a machine learning model for further analysis.

What is Clustering?

Clustering is a most popular unsupervised learning where population or data is grouped based on the similarity of the data-points.

Let’s understand this with an example. Suppose, you are the head of a general store and you want to understand preferences of your costumers to scale up your business. It will not possible for you to look at details of each costumer and devise a unique business strategy for each one of them. But, what you can do is to cluster all of your costumers into say 10 groups based on their purchasing habits and use a separate strategy for costumers in each of these 10 groups. This is called clustering.

To know more about unsupervised learning click here.

Following are the type of clustering application

Marketing: Finding groups of customers with similar behaviour given a large database of customer data containing their properties and past buying records.
Biology: Classification of plants and animals given their features.
Libraries: Book ordering.
Insurance: Identifying groups of motor insurance of policy holders with a high average claim cost; identifying frauds.
City-planning: Identifying groups of houses according to their house type, value and geographical location
Earthquake studies: Clustering helps to observe earthquake epicentres and identify dangerous zones.
WWW: Document classification; clustering weblog data to discover groups of similar access patterns

Now let’s discuss the most popular clustering algorithms in detail – K Means clustering and Hierarchical clustering. Let’s begin.

Types of clustering algorithm:

There are several algorithms are available to solve clustering related problem. Every methodology follows a different set of rules for defining the ‘similarity’ among data points. In fact, there are more than 100 clustering algorithms known. But few of the algorithms are very popular, let’s look at them in detail:

Centroid models: This is basically one of iterative clustering algorithm in which the clusters are formed by the closeness of data points to the centroid of clusters. Here, the cluster center i.e. centroid is formed such that the distance of data points is minimum with the center. K-Means clustering algorithm is a popular algorithm that falls into this category. The biggest problem with this algorithm is that we need to specify K in advance.

Distribution models: These clustering models are based on the notion of how probable is it that all data points in the cluster belong to the same distribution (For example: Normal, Gaussian). As distance from the distributions center increases, the probability that a point belongs to the distribution decreases. These models often suffer from overfitting. The bands show that decrease in probability. When you do not know the type of distribution in your data, you should use a different algorithm.

Density Models: Density-based clustering connects areas of high example density into clusters. These algorithms have difficulty with data of varying densities and high dimensions. Further, by design, these algorithms do not assign outliers to clusters.

K Means Clustering:

K means clustering is an effective, widely used, all-around used clustering algorithm. Before actually running it, we have to define a distance function between data points (for example, Euclidean distance if we want to cluster points in space), and we have to set the number of clusters we want (k)

The algorithm begins by selecting k points as starting centroids (‘centers’ of clusters). We can just select any k random points, or we can use some other approach, but picking random points is a good start.

This algorithm works in these 5 steps:

Specify the desired number of clusters K : Let us choose k=2 for these 5 data points in 2-D space.

2. Randomly assign each data point to a cluster : Let’s assign three points in cluster 1 shown using red color and two points in cluster 2 shown using grey color.

3. Compute cluster centroids : The centroid of data points in the red cluster is shown using red cross and those in grey cluster using grey cross.

4. Re-assign each point to the closest cluster centroid : Note that only the data point at the bottom is assigned to the red cluster even though its closer to the centroid of grey cluster. Thus, we assign that data point into grey cluster.

5. Re-compute cluster centroids : Now, re-computing the centroids for both the clusters.

6. Repeat steps 4 and 5 until no improvements are possible : Similarly, we’ll repeat the 4th and 5th steps until we’ll reach global optima. When there will be no further switching of data points between two clusters for two successive repeats. It will mark the termination of the algorithm if not explicitly mentioned

Hierarchical Clustering

Hierarchical clustering creates a tree of clusters. Hierarchical clustering, not surprisingly, is well suited to hierarchical data, such as taxonomies. This algorithm starts with all the data points assigned to a cluster of their own. Then two nearest clusters are merged into the same cluster. In the end, this algorithm terminates when there is only a single cluster left.

There are two important things that we should know about hierarchical clustering:

This algorithm has been implemented above using bottom up approach. It is also possible to follow top-down approach starting with all data points assigned in the same cluster and recursively performing splits till each data point is assigned a separate cluster.
The decision of merging two clusters is taken on the basis of closeness of these clusters. There are multiple metrics for deciding the closeness of two clusters :
- Euclidean distance: ||a-b||₂ = √(Σ(a_i-b_i))
- Squared Euclidean distance: ||a-b||₂² = Σ((a_i-b_i)²)
- Manhattan distance: ||a-b||₁ = Σ|a_i-b_i|
- Maximum distance:||a-b||_INFINITY = max_i|a_i-b_i|
- Mahalanobis distance: √((a-b)^T S^-1 (-b)) {where, s : covariance matrix}

Difference between K Means and Hierarchical clustering

Hierarchical clustering can’t handle big data well but K Means clustering can. This is because the time complexity of K Means is linear i.e. O(n) while that of hierarchical clustering is quadratic i.e. O(n²).
In K Means clustering, since we start with random choice of clusters, the results produced by running the algorithm multiple times might differ. While results are reproducible in Hierarchical clustering.
K Means is found to work well when the shape of the clusters is hyper spherical (like circle in 2D, sphere in 3D).
K Means clustering requires prior knowledge of K i.e. no. of clusters you want to divide your data into. But, you can stop at whatever number of clusters you find appropriate in hierarchical clustering by interpreting the dendrogram

Reference –

AnalyticsVidya

Google Developers

Tag clustering

Credit Card Segmentation