Digits classification

1. Project introduction

This project focuses on classifying the digits from Digit images. I’ll perform this classification by KNN & SVM method.

Since the interest of image recognition is getting higher in almost every industry in our society recently, I would like to perform the basic classification by using the Handwritten Digit Data in the ElemStatLearn package.

Tools: R (caret / ElemStatLearn packages)
Data: the Handwritten Digit Data in ElemStatLearn package (https://www.rdocumentation.org/packages/ElemStatLearn/versions/2015.6.26.2)
Analyses performed: K-Nearest Neighbor (KNN), Linear/Radial Support Vector Machine (SVM)

2. Data description & exploration

Data description

This dataset consist of two parts - train & test data. For both dataset, the first column is the actual digit information, and the other columns are indicates each pixel’s values of the Digit image.

Data exploration

In this dataset, there are 7,291 observations with 257 variables in the train data and 2,007 observations with 257 variables in the test data. We can confirm this as below in order (train, test):

## [1] 7291  257

## [1] 2007  257

For this project, I’ll perform the digit classification of only three digits - \(2\) and \(7\). Thus, we can consider this project as a two-class classification. Thus, our train data has \(1376\) observations and the test data has \(345\) observations.

## [1] 1376  257

## [1] 345 257

3. Modeling

Fitting K-Nearest Neighbors (KNN)

To fit the K-Nearest Neighbors (KNN) model, we need to define the K, which is the number of exploring neighbors (observations). I performed 5-fold cross-validation with 3 repeats to find the best K in the range of \(2\) ~ \(10\).

## k-Nearest Neighbors 
## 
## 1376 samples
##  256 predictor
##    2 classes: '2', '7' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 1101, 1101, 1100, 1101, 1101, 1100, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.9866728  0.9732743
##    2  0.9840105  0.9679423
##    3  0.9849794  0.9698754
##    4  0.9842521  0.9684142
##    5  0.9844963  0.9689056
##    6  0.9825586  0.9650275
##    7  0.9830435  0.9659957
##    8  0.9811058  0.9621211
##    9  0.9818305  0.9635700
##   10  0.9818314  0.9635756
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 1.

From the results, the best K seems \(1\) and \(3\), in terms of classification error (CV). Although the CV showed the lowest value when the \((K=1)\), it might cause overfitting issue.
Thus, I will try to compare the prediction for \((k=1)\) and \((k=3)\) for deciding the best tuning K. We can see this by the graph (Classification error vs. K) as below:

K-Nearest Neighbor (KNN) - Accuracy

I checked the Confusion Matrix and the statistics including Accuracy of each model. We can see the results of the model with \((k=1)\) as below:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   2   7
##          2 193   1
##          7   5 146
##                                           
##                Accuracy : 0.9826          
##                  95% CI : (0.9625, 0.9936)
##     No Information Rate : 0.5739          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9646          
##                                           
##  Mcnemar's Test P-Value : 0.2207          
##                                           
##             Sensitivity : 0.9747          
##             Specificity : 0.9932          
##          Pos Pred Value : 0.9948          
##          Neg Pred Value : 0.9669          
##              Prevalence : 0.5739          
##          Detection Rate : 0.5594          
##    Detection Prevalence : 0.5623          
##       Balanced Accuracy : 0.9840          
##                                           
##        'Positive' Class : 2               
##

The results of the model with \((k=3)\) is as below:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   2   7
##          2 194   1
##          7   4 146
##                                           
##                Accuracy : 0.9855          
##                  95% CI : (0.9665, 0.9953)
##     No Information Rate : 0.5739          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9704          
##                                           
##  Mcnemar's Test P-Value : 0.3711          
##                                           
##             Sensitivity : 0.9798          
##             Specificity : 0.9932          
##          Pos Pred Value : 0.9949          
##          Neg Pred Value : 0.9733          
##              Prevalence : 0.5739          
##          Detection Rate : 0.5623          
##    Detection Prevalence : 0.5652          
##       Balanced Accuracy : 0.9865          
##                                           
##        'Positive' Class : 2               
##

From the results, the model with \((k=1)\) showed \(98.26\)% for accuracy and the model with \((k=3)\) showed \(98.55\)% for accuracy.
Although the \((k=1)\) model has less cross-validation error than \((k=3)\) model, the \((k=3)\) model showed better testing error for prediction. Therefore, I have chosen \((k=3)\) for best tuning parameter of my model.

Check point

We also can notice that the testing error curve is not a U-shaped or monotone. In particular, \(k =\) 3 or 4 is better than \(k = 2\), but \(k = 1\) is the best.
This is because the even number of “k” can cause tie problem when the model make decision. Therefore, the model with even numbers of k \((2, 4, 6, ...)\) showed worse cross-validation error for our model. This “tie” problem can impact badly on the accuracy of the model.
In addition, generally, the prediction error getting smaller as the k increasing. This is because the variance are going to less, whereas the bias increasing. In other words, as k increases, the model will be stable.

Fitting SVM - Linear

To fit the Support Vectore Machine (SVM), we need to find the best C (cost) for linear SVM. I’ll use 10-fold cross-validation to find C from 0.001 to 0.2.

## Support Vector Machines with Linear Kernel 
## 
## 1376 samples
##  256 predictor
##    2 classes: '2', '7' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 1238, 1238, 1237, 1239, 1239, 1239, ... 
## Resampling results across tuning parameters:
## 
##   cost        Accuracy   Kappa    
##   0.00100000  0.9890987  0.9781289
##   0.01147368  0.9920025  0.9839578
##   0.02194737  0.9934624  0.9868800
##   0.03242105  0.9927325  0.9854173
##   0.04289474  0.9912832  0.9825163
##   0.05336842  0.9927378  0.9854392
##   0.06384211  0.9927378  0.9854392
##   0.07431579  0.9927378  0.9854392
##   0.08478947  0.9927378  0.9854392
##   0.09526316  0.9934677  0.9869068
##   0.10573684  0.9934677  0.9869068
##   0.11621053  0.9927378  0.9854421
##   0.12668421  0.9927378  0.9854421
##   0.13715789  0.9927378  0.9854421
##   0.14763158  0.9927378  0.9854421
##   0.15810526  0.9927378  0.9854421
##   0.16857895  0.9927378  0.9854421
##   0.17905263  0.9927378  0.9854421
##   0.18952632  0.9927378  0.9854421
##   0.20000000  0.9927378  0.9854421
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cost = 0.09526316.

The best C is \(0.0952\) based on the accuracy.

Linear SVM - Accuracy

I have obtained the confusion table with the best C (\(0.0952\)) as below, and we can see the accuracy of this SVM model is \(98.84\)%, which is pretty reliable.

##             
## svm_lin_pred   2   7
##            2 196   2
##            7   2 145

## [1] 0.9884058

Fitting SVM - Radial

To fit the Radial SVM, we need to decide the best C and sigma first. I’ll perform 10-fold cross-validation and repeat it 3 times.

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 1376 samples
##  256 predictor
##    2 classes: '2', '7' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 1238, 1238, 1237, 1239, 1239, 1239, ... 
## Resampling results across tuning parameters:
## 
##   C     sigma  Accuracy   Kappa    
##   0.25  0.005  0.9903082  0.9805245
##   0.25  0.010  0.9907913  0.9814923
##   0.25  0.015  0.9903099  0.9805074
##   0.50  0.005  0.9907930  0.9815026
##   0.50  0.010  0.9915177  0.9829483
##   0.50  0.015  0.9910398  0.9819760
##   0.75  0.005  0.9912761  0.9824696
##   0.75  0.010  0.9910346  0.9819788
##   0.75  0.015  0.9907965  0.9814948
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.01 and C = 0.5.

I have obtained the best C \((0.5)\) and sigma \((0.01)\) based on the accuracy.

Radial SVM - Accuracy

I have obtained the confusion table above, and the testing data accuracy was \(98.55\)% based on those C and the sigma.

##             test.y
## svm_rad_pred   2   7
##            2 197   4
##            7   1 143

## [1] 0.9855072

A comparison between KNN / Linear SVM / Radial SVM methods

The comparison between three classification methods are as below:

	Accuracy	Cost	Sigma
KNN	0.9855	-	-
Linear SVM	0.9884	0.0952	-
Radial SVM	0.9855	0.75	0.01

From the results, we can conclude that Linear SVM model showed better in terms of accuracy. However, the results could be changed when we use another dataset or different conditions.

4. Conclusion

From the analyzing results, our Linear Support Vectore Machine (SVM) showed the best accuracy \(98.84\)%. However, the other models (KNN, Radial SVM) also showed pretty reliable accuracy.
Since we performed those classification method only with two digits \(2\) and \(7\) from the dataset, this model would be improved its classification ability if we put more observations than just 2 digits.