1. Project introduction
This project focuses on classify whether the customer will buy the stuffs or not with their estimated salary and age.
Predicting the characteristics of customers who will buy products in advance is becoming very important for business operators. Thus, our goal is classify purchased customer based on their characteristics.
- Tools: R (caret / ElemStatLearn / e1071 / kernlab packages)
- Data: Social Network Ads data from Kaggle (https://www.kaggle.com/datasets/rakeshrau/social-network-ads)
- Analyses performed: Liner SVM
2. Data description & exploration
Social Network Ads
is a dataset including information about the customer’s several characteristics and he/she purchased a product or not. This dataset consists of 400 observations with 5 variables, which areUser.ID
,Gender
,Age
,EstimatedSalary
, andPurchased
(0: not purchased, 1: purchased).The goal is classify whether each customer bought the product or not, so the response variable will be
Purchased
. I will use onlyAge
andEstimatedSalary
variables for classification.
3. Pre-processing
- As we can see below, each variable has different numerical ranges. Thus, we need to scale and center both covariates before we proceed the analysis.
## Age EstimatedSalary Purchased
## 1 19 19000 0
## 2 35 20000 0
## 3 26 43000 0
## 4 27 57000 0
## 5 19 76000 0
## 6 27 58000 0
- Then, the data converted as below:
## Age EstimatedSalary Purchased
## 1 -1.7795688 -1.4881825 0
## 2 -0.2532702 -1.4588544 0
## 3 -1.1118131 -0.7843075 0
## 4 -1.0164195 -0.3737137 0
## 5 -1.7795688 0.1835208 0
## 6 -1.0164195 -0.3443855 0
4. Fit a linear SVM
- Before fitting a linear SVM, let’s see how the data looks like. The orange dots showed customers purchased a product, and the green dots showed others.
We can expect that the data could be divided by a single line located in the middle part. The line can be placed anywhere between green & orange dots, but we need to find the optimal line - which is able to divide all dots the most efficiently. Thus, I tried to fit a linear SVM with
cost = 1
.We can see the confusion matrix and classification error rate as below. The classification error for this model is \(15.75\)%, so it is pretty reliable.
##
## 0 1
## 0 240 46
## 1 17 97
## [1] 0.1575
- Now, I drawed the decision line on the plot as
below. The decision line is \(f(x) = x^T
\boldsymbol \beta + \beta_0 = 0\), and this can be calculated by
svm
package. Thus, estimated \(\beta\) are \(-1.6509\) forAge
and \(-0.9101\) forEstimatedSalary
. Also, estimated \(\beta_0\) is \(0.8792\). The decision line was drawn almost exactly as we expected.
- Now, I marked the support vectors to the plot. Support vectors are data points located on the boundary between two classes. Since these data support a decision boundary, it is called a support vector.
- As we can see, it is impossible to classify all points perfectly. Thus, we need to approve a little error by adjust C (cost). I used \(1\) as C before, so I tried to change C to \(100\) and \(0.01\) as below (in order):
- From the results, the boundaries getting wider when we set higher C - the accuracy also changed as C getting high, as below:
Accuracy | |
---|---|
Cost = 0.01 | 0.1925 |
Cost = 1 | 0.1575 |
Cost = 100 | 0.155 |
5. Conclusion
From the analysis, the linear SVM classified customer pretty well based on their characteristics (
Age
andEstimatedSalary
) when C (cost) = \(1\). However, the classification results could be changed when we adjust C. Thus, we need to find the define C for each situation.Since we performed those classification method only with two variables, this SVM would be improved its classification ability if we use more characteristics of the customers, such as
Gender
.