1. Project introduction

This project focuses on classify whether the customer will buy the stuffs or not with their estimated salary and age.

Predicting the characteristics of customers who will buy products in advance is becoming very important for business operators. Thus, our goal is classify purchased customer based on their characteristics.


2. Data description & exploration

  • Social Network Ads is a dataset including information about the customer’s several characteristics and he/she purchased a product or not. This dataset consists of 400 observations with 5 variables, which are User.ID ,Gender, Age, EstimatedSalary, and Purchased (0: not purchased, 1: purchased).

  • The goal is classify whether each customer bought the product or not, so the response variable will be Purchased. I will use only Age and EstimatedSalary variables for classification.


3. Pre-processing

  • As we can see below, each variable has different numerical ranges. Thus, we need to scale and center both covariates before we proceed the analysis.
##   Age EstimatedSalary Purchased
## 1  19           19000         0
## 2  35           20000         0
## 3  26           43000         0
## 4  27           57000         0
## 5  19           76000         0
## 6  27           58000         0
  • Then, the data converted as below:
##          Age EstimatedSalary Purchased
## 1 -1.7795688      -1.4881825         0
## 2 -0.2532702      -1.4588544         0
## 3 -1.1118131      -0.7843075         0
## 4 -1.0164195      -0.3737137         0
## 5 -1.7795688       0.1835208         0
## 6 -1.0164195      -0.3443855         0

4. Fit a linear SVM

  • Before fitting a linear SVM, let’s see how the data looks like. The orange dots showed customers purchased a product, and the green dots showed others.

  • We can expect that the data could be divided by a single line located in the middle part. The line can be placed anywhere between green & orange dots, but we need to find the optimal line - which is able to divide all dots the most efficiently. Thus, I tried to fit a linear SVM with cost = 1.

  • We can see the confusion matrix and classification error rate as below. The classification error for this model is \(15.75\)%, so it is pretty reliable.

##    
##       0   1
##   0 240  46
##   1  17  97
## [1] 0.1575
  • Now, I drawed the decision line on the plot as below. The decision line is \(f(x) = x^T \boldsymbol \beta + \beta_0 = 0\), and this can be calculated by svm package. Thus, estimated \(\beta\) are \(-1.6509\) for Age and \(-0.9101\) for EstimatedSalary. Also, estimated \(\beta_0\) is \(0.8792\). The decision line was drawn almost exactly as we expected.

  • Now, I marked the support vectors to the plot. Support vectors are data points located on the boundary between two classes. Since these data support a decision boundary, it is called a support vector.

  • As we can see, it is impossible to classify all points perfectly. Thus, we need to approve a little error by adjust C (cost). I used \(1\) as C before, so I tried to change C to \(100\) and \(0.01\) as below (in order):

  • From the results, the boundaries getting wider when we set higher C - the accuracy also changed as C getting high, as below:
Accuracy
Cost = 0.01 0.1925
Cost = 1 0.1575
Cost = 100 0.155

5. Conclusion

  • From the analysis, the linear SVM classified customer pretty well based on their characteristics (Age and EstimatedSalary) when C (cost) = \(1\). However, the classification results could be changed when we adjust C. Thus, we need to find the define C for each situation.

  • Since we performed those classification method only with two variables, this SVM would be improved its classification ability if we use more characteristics of the customers, such as Gender.