Section One

Project Overview

The following page is a deployment/visualization of clustering and classification project results based on the web. This project aims to fulfill the final assignment of the Dicoding course by meeting six criteria, which include:

  • Criterion 1: Using the Provided Notebook Template
  • Criterion 2: Selecting Dataset Columns
  • Criterion 3: Achieving a Minimum Silhouette Score of 0.55 in Final Clustering Evaluation
  • Criterion 4: Writing a Detailed Cluster Interpretation
  • Criterion 5: Using Dataset and Labels from the Clustering Results
  • Criterion 6: Achieving a Minimum Accuracy and F1-Score of 87% on the Training and Testing Set

The clustering project started with loading the data then followed by Exploratory Data Analysis (include missing value, duplicated data and outliers identification), data preprocessing (include data cleaning and data encoding, standardization & normalization), Clustering Model Development (include feature selection with grid search, model selection and model evaluation) and Cluster Visualization. The final evaluation of the model indicated a good performance, achieving a silhouette score about 59%. The data labeled with the cluster was then exported to a CSV file. The file is then used for the classification project.

The classification project began with loading the file then followed by data splitting (7:3), Classification Model Development (with Random Forest), Hyperparameter Tuning with grid search, and Model Evaluation. It turns out that the model's performance was poor, with very low accuracy and F1 score. Undersampling and Oversampling was used to handle the problem. The last evaluation of the model showed that the model's performance was good, with accuracy and F1 score above 90%.

As explained earlier, this page only presents the results and its visualizations of the clustering and classification project. For more details, you can access it through the following link: Machine Learing for Clustering, Machine Learing for Classification

Section Two

The Dataset

Customer Personality Analysis Dataset


Customer Personality Analysis is an in-depth analysis of the ideal customers for a company. It helps businesses better understand their customers and enables them to tailor products according to the specific needs, behaviors, and interests of different customer types.

Customer personality analysis allows businesses to adjust their products based on the target customers from various segments. For example, instead of spending money marketing a new product to all customers in the company's database, the company can analyze which customer segments are most likely to purchase the product and then market it only to that segment.

This dataset was chosen from the "Suggested Dataset Sources" in the Machine Learning Project Submission.

Number of rows: 2240 | Number of columns: 29 | Dataset Source: Dataset Link (Kaggle)

  • 1. ID: Customer’s unique identifier
  • 2. Year_Birth: Customer's birth year
  • 3. Education: Education Qualification of customer
  • 4. Marital_Status: Marital Status of customer
  • 5. Income: Customer's yearly household income
  • 6. Kidhome: Number of children in customer's household
  • 7. Teenhome: Number of teenagers in customer's household
  • 8. Dt_Customer: Date of customer's enrollment with the company
  • 9. Recency: Number of days since customer's last purchase
  • 10. Complain: 1 if the customer complained in the last 2 years, 0 otherwise
  • 11. MntWines: Amount spent on wine in last 2 years
  • 12. MntFruits: Amount spent on fruits in last 2 years
  • 13. MntMeatProducts: Amount spent on meat in last 2 years
  • 14. MntFishProducts: Amount spent on fish in last 2 years
  • 15. MntSweetProducts: Amount spent on sweets in last 2 years
  • 16. MntGoldProds: Amount spent on gold in last 2 years
  • 17. NumDealsPurchases: Number of purchases made with a discount
  • 18. AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
  • 19. AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
  • 20. AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
  • 21. AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
  • 22. AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
  • 23. Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
  • 24. NumWebPurchases: Number of purchases made through the company’s website
  • 25. NumCatalogPurchases: Number of purchases made using a catalogue
  • 26. NumStorePurchases: Number of purchases made directly in stores
  • 27. NumWebVisitsMonth: Number of visits to company’s website in the last month
  • 28. Z_CostContact
  • 29. Z_Revenue

Section Three

Clustering

Modeling without Feature Selection


In this project, two models were selected, namely DBSCAN and K-Means, to be tested without prior feature selection. The evaluation results using silhouette score were 0.590 (K-Means) and 0.22 (DBSCAN).

Clustering 1

The K-Means model shows good performance, but the issue is that it uses too many features, which can make interpreting the clustering results more difficult. It means the features selection was needed.

Modeling with Feature Selection


To solve the issue, Correlation Matrix was used to show which features has the most significant correlation to each other.

Clustering 2

Based on the correlation matrix, 16 features are considered, with 5 or 6 to be selected for model training. Here are the 16 features:

  • Year_Birth
  • Education
  • Income
  • MntWines
  • MntFruits
  • MntMeatProducts
  • MntFishProducts
  • MntSweetProducts
  • MntGoldProds
  • NumWebPurchases
  • NumCatalogPurchases
  • NumStorePurchases
  • Kidhome
  • NumWebVisitsMonth
  • Marital_Status
  • Response

Since we select 5 or 6 out of 16, It will take so much times if we select them manually.

Mathematically, if we selected 5 out of 16 features, we will need C_16^5=16!/(16-5)!5!=4368 trial to find the best 5 features (globally optimal) and C_16^6=16!/(16-6)!6!=8008 trial for the best 6 features selection. From here, the grid search was used to find the best 5 or 6 features, since 8008 trial was considered as ‘not a really big number’ to do the grid search.

The grid search applied to K-Means clustering showed that the best 5 features were MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts, and Response, with a total of 3 clusters and a 0.593 silhouette score. This was better than the first K-Means model, where feature selection was not applied, because the second model had less features and better Silhouette Score.

In contrast, the best 6 features, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts, MntGoldProds, and Response, produced 5 clusters with a silhouette score of 0.522, which doesn’t meet the minimum silhouette score requirements and was worse than the model with the best 5 features. It means

DBSCAN was not used in this case due to the extensive time required for feature selection and hyperparameter tuning. Since the DBSCAN and K-Means models with the best 6 features were not used, the only remaining option is the K-Means model with the best 5 features.

Cluster Visualization


The K-Means with best 5 features was visualized with matplotlib. Here is the visualization:

Clustering 3

Cluster Interpretation


Here is the conclusion of the cluster interpretation. To see more detailed process of this part, visit the google colab link: Machine Learing for Clustering

Features:

  • MntFruits: Amount spent on fruits in the last 2 years
  • MntMeatProducts: Amount spent on meat in the last 2 years
  • MntFishProducts: Amount spent on fish in the last 2 years
  • MntSweetProducts: Amount spent on sweets in the last 2 years
  • Response: 1 if the customer accepted the offer in the last campaign, 0 otherwise

Cluster 0:

  • Customers in Cluster 0 have moderate spending on fruits.
  • Customers in Cluster 0 have moderate spending on meat products.
  • Customers in Cluster 0 have low spending on fish products.
  • Customers in Cluster 0 have moderate spending on sweet products.
  • It can be interpreted that customers in Cluster 0 are more likely to respond with "False" for "Response."

Cluster 1:

  • Customers in Cluster 1 have low spending on fruits.
  • Customers in Cluster 1 have low spending on meat products.
  • Customers in Cluster 1 have moderate spending on fish products.
  • Customers in Cluster 1 have low spending on sweet products.

Cluster 2:

  • Customers in Cluster 2 have high spending on fruits.
  • Customers in Cluster 2 have high spending on meat products.
  • Customers in Cluster 2 have high spending on fish products.
  • Customers in Cluster 2 have high spending on sweet products.
  • It can be interpreted that customers in Cluster 2 are more likely to respond with "True" for "Response."

Section Four

Classification

Modeling without Hyperparameter Tuning


The labeled data with cluster numbers was exported for use in the ML-Classification project. The model used in this project is the Random Forest Classifier. Initially, the data was split into a 70:30 ratio. The 70% portion was then used to train the model with default hyperparameters. The model's performance is as follows:

Classification 1

Modeling with Hyperparameter Tuning


The model appears to exhibit overfitting, as it performs well on the training data but significantly worse on the testing data. To address this issue, Hyperparameter Tuning was conducted. Grid Search was chosen as the Hyperparameter Tuning algorithm for this project. The following parameter grid was considered:

Classification 2

After completing hyperparameter tuning, the best model was selected and its performance evaluated. Below is a comparison of the model's performance before and after hyperparameter tuning:

Classification 3

Modeling after Oversampling and Undersampling


The model after hyperparameter tuning appears to exhibit underfitting, as indicated by poor performance on both the training and testing data. To address this issue, Oversampling and Undersampling methods were applied. Once these methods were implemented, the model was trained both with and without hyperparameter tuning. Below is a comparison of the model's performance before and after hyperparameter tuning.

Classification 4