Comparative analysis of support vector machine and k-nearest neighbors with a pyramidal histogram of the gradient for sign language detection

: The communication method using sign language is very efficient considering that the speed of information delivery is closer to verbal communication (speaking) compared to writing or typing. Because of this, sign language is often used by people who are deaf, speech impaired, and normal people to communicate. To make sign language translation easier, a system is needed to translate symbols formed from hand movements (in the form of images) into text or sound. This study aims to compare performance such as accuracy and computation time of Support Vector Machine (SVM) and K-Nearest Neighbors (KNN) with Pyramidal Histogram of Gradient (PHOG) for feature extraction, to know which one is better at recognizing sign language. Yield, both combined methods PHOG-SVM and PHOG-KNN can recognize images from hand movements that form certain symbols. The system built using the SVM classification produces the highest accuracy of 82% at PHOG level 3, while the system built with the KNN classification produces the highest accuracy of 78% at PHOG level 2. The total computation time of the fastest training and testing by the SVM model is 236.53 seconds at PHOG level 3, while the KNN model is 78.27 seconds at PHOG level 3. In terms of accuracy, PHOG-SVM is better, but in terms of computation time, PHOG-KNN takes the place.


Introduction
Sign language is one of several communication methods that can be used by deaf and mute persons as well as normal people to communicate. Communication is carried out by mimicking and hand movements that form symbols so that they interpret a letter or word. The method of communication using sign language is very efficient considering that the speed of information delivery is closer to verbal communication (speaking) compared to writing or typing. To make sign language translation easier, a system is needed to translate symbols formed from hand movements (in the form of images) into text or sound.
Image translation to recognize a certain pattern or shape can be done using the Pyramid Histogram of Oriented Gradient (PHOG) feature extraction method and the Support Vector Machine (SVM) classification method. Research on sign language recognition using these methods has been conducted [1] with an accuracy of up to 86%. Meanwhile, research [2] obtained an accuracy of up to 91.8% using a combination of the Haar Classifier and K-Nearest Neighbors (KNN) with different datasets. Therefore, this study aims to compare performance such as accuracy and computation time of SVM and KNN with PHOG for feature extraction and the same dataset like [1] to find out which one is better at recognizing sign language.
Research on sign language recognition has also been conducted by [3] using the Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) methods resulting in an average accuracy value of 60.58%. In addition, [4] also conducted research using the Generalized Learning Vector Quantization (GLVQ) method with an alpha value of 0.9 resulting in the highest accuracy of 71.37%. Another study by [5] used the Hebb Rule to recognize static sign language and resulted in an accuracy of 80.37%.
Research on letter pattern recognition apart from sign language has also been carried out by [6] using Adaline and concluded that Adaline is effective to be applied in the development of letter pattern recognition applications if the reduced alpha value is 0.05, the enlarged tolerance value is 0.1 and uses bipolar activation function. Research by [7] used an artificial neural network (ANN) to recognize the character pattern of the Karo script. The ANN method studied is perceptron and can recognize up to 100%.
The SVM and KNN classifications have been studied by [8] using Partial Least Square (PLS) as dimension reduction, SVM, and KNN as classifications. This study produced the highest accuracy of 98.54% on leukemia data with PLS-KNN, 100% on lung data with KNN, 66.52% on breast data with PLS-KNN, and 85.60% on colon data with PLS-SVM.
Research on the introduction of tajwid law has been carried out by [9] using the KNN with Local Mean which has succeeded in achieving the highest number of 96.43%.
Another research about image recognition was also conducted by [10], but this study focused on license plate recognition using Object Character Recognition (OCR). The success rate for recognizable license plates was 75%.

Pyramidal Histogram of Gradient (PHOG)
The way the PHOG method works is to do HOG according to the size of the cells and the orientation bin then combine them into 1 feature like the formula [11].

Dataset
The dataset used is an open-source dataset from [12]. Consists of a collection of near-infrared (near-infrared) images and skeletal information obtained from the Leap Motion sensor. The composition of the dataset includes 16 types of hand movements including palm, L, fist-moved, down, index, ok, palm m, C, heavy, hang, two, three, four, five, palm u, and up from 15 subjects (five male and 10 female). This dataset has a total of 13000 hand motion infrared images. However, based on [1] the types of hand movements used are palm, L, down, index, ok, and C. Out of 13,000 images, [1] states that only 3800 images are used, with the division of 3300 images for training data and 500 test data. However, in the discussion in [1], only 486 images were used, so that the total images used were 3786 images with random sampling. The distribution of the number of images for each type of hand movement can be seen in tables 1 and 2.  Table 1 shows the six types of training data hand movements (classes). Palm as many as 550 images, L as many as 550 images, Down as many as 550 images, Index as many as 550 images, Ok as many as 550 images, and C as many as 550 images. The total of images for training is 3300 images.  Table 2 shows the six types of test data hand movements (classes). Palm as many as 80 images, L as many as 80 images, Down as many as 86 images, Index as many as 70 images, Ok as many as 90 images, and C as many as 80 images. The total number of images for training is 486 images.

System Design
This study aims to compare two classification methods, namely SVM and KNN to find out which one is better at recognizing sign language in terms of accuracy and computation time. The feature extraction used for the comparison of the two methods is PHOG levels 1, 2, and 3. The three main stages in this research are Pre-processing & Feature Extraction (PPFE), Training, and Testing.  Figure 1 shows the PPFE stage for training data (a) and test data (b). Overall, the process for both data is the same, namely through pre-processing and extraction of PHOG features. The output of this stage is training data and test data that have been pre-processed and feature extraction. The output training data is used for the SVM and KNN Training stage, while the test data is used for the SVM and KNN Testing stage.

Pre-processing
The image from [12] varies in size from 408 x 264 to 420 x 273, so it is necessary to equalize the image size. All image sizes are converted to 408 x 264 using the crop method. When viewed from the largest size, which is 420 x 273, the reduction that occurs is only 9 to 12 pixels.

Feature Extraction
After the image pre-processing, the characteristics of the image are extracted using the PHOG method level 1, 2, and 3 for the training image and the test image. Each PHOG level applies HOG with a bin of 9, pixels per cell 8 x 8, and cells per block 2 x 2.

Training
Training or modeling using the SVM and KNN methods on 3300 training data that has passed the pre-processing and feature extraction processes. SVM uses a polynomial kernel configuration, degree 3, and tolerance of 0.00001 while KNN uses a number of neighbors of 3 (k = 3), the configuration of each method applied to all PHOG levels. The result of this training is that there are 3 SVM models and KNN models each following each PHOG level.

Testing
Testing was carried out with the SVM and KNN models on 486 test data that had passed the pre-processing and feature extraction processes. The test data of the PHOG level 1 feature extraction results were tested with the SVM model and the KNN level 1 model, and so on. The metric used for evaluating the recognition performance is the confusion matrix (CM) presented in graphical form. From CM values, four different measurements have been collected.
Precision (P) is the ratio of true positive predictions compared to the overall positive predicted outcome. The precision formula is as follows: Recall (R) is the ratio of true positive predictions compared to the total number of true positive data. The recall is calculated through: F1-score (F) is a weighted comparison of the average precision and recall. F1-score is calculated as: Accuracy (A) is the ratio of Correct predictions (positive and negative) to the overall data. The accuracy formula is as follows: Where tp represents true positive samples, tn the true negative samples, fp the false-positive samples, and fn the false-negative samples [13].

Results and Analysis Correct Prediction Graph
The results of the SVM Confusion Matrix model for each PHOG level in [1] represents in graphical form can be seen in Figure 4.

Computation Time
This study also compared the computation time required by the SVM and KNN methods for training 3300 images and testing 486 images. Computation time is calculated in seconds. The hardware used is the Macbook Pro 2019. The time required for each method to PHOG level n can be seen in Table 6. However, research [1] did not take into account the computation time, so table 6 only shows the computation time carried out in this study only.    Figure 4 and Figure 5 are expected to have similar results, considering that these figures use the same dataset source, pre-processing method, feature extraction, and classification. However, the results are quite different, in Figure 4 the total correct prediction is 366, 404, and 420, respectively, while Figure 5 are 394, 394, and 400 respectively. Quite a big increase occurs in Figure 4, but only a little in Figure 5.

Analysis
On the other hand, Figure 4 and Figure 5 have similarities, namely prediction or testing in class L and Index has poor results, followed by classes Ok and C, while the best prediction results are in the Palm and Down classes.
Switch to Precision, Recall, F1-Score, and accuracy results. A clear difference occurs in table 3 as a whole experience an increase in the precision, recall, f1-score, and accuracy results. A clear difference can be seen in the accuracy of Table 3, respectively, namely 75%, 83%, and 86%, while Table 4 hardly experience a significant increase, namely 81%. 81%, and 82%.
The difference in the results of confusion matrix, precision, recall, f1-score, and accuracy is very possible due to two factors. The first factor is the selection of the image dataset from 13000, only 3786 images were selected, it is very possible that the image used in [1] is not the same as the research conducted. The second factor is [1] not explicitly mentioning the SVM configuration used, so in this study, the authors tried several configurations and the best results were obtained on the SVM configuration with a polynomial kernel, degree 3, and tolerance 0.00001.
This difference can still be tolerated considering the results of the two SVMs have not been missed too far and the research conducted has used the source dataset, pre-processing methods, feature extraction, and the same classification as previous studies. For the SVM and KNN comparisons to be valid, comparisons are only made to the SVM and KNN models built in this study.
Returning to the results of the confusion matrix, the SVM model in Figure 5 has better results than the KNN model in Figure 6 where the total correct predictions on PHOG-1-SVM, PHOG-2-SVM, and PHOG-3-SVM respectively were 394, 394, and 400, while the PHOG-1-KNN, PHOG-2-KNN, and PHOG-3-KNN were 365, 381, and 370 respectively. The SVM model experienced an increase in the number of correct predictions although not significant, in the KNN model there was an increase from PHOG-1-KNN to PHOG-2-KNN then get off at PHOG-3-KNN.
There are similarities to the results of PHOG-1-SVM, PHOG-2-SVM, PHOG-3-SVM, PHOG-1-KNN, PHOG-2-KNN, and PHOG-3-KNN, namely predictions in class L and Index have poor results, followed by classes Ok and C, while the results best predictions for the Palm and Down classes. If we review the L and Index images, in plain view there are indeed similarities in several images as in Figures 7 and 8. Then the results of precision, recall, f1-score, and accuracy from table 4 show that PHOG-1-SVM, PHOG-2-SVM, and PHOG-3-SVM have increased insignificantly, even though there are classes that have low precision, recall, and f1-scores such as L, Index, Ok, and C. Low f1score results indicate that the model is difficult to distinguish from other classes. Low F1-score due to low precision and recall. The accuracy obtained by the PHOG-1-SVM and PHOG-2-SVM models is 81% and PHOG-3-SVM is 82%. Meanwhile, based on table 5 the PHOG-1-KNN, PHOG-2-KNN, and PHOG-3-KNN models have lower precision, recall, f1-score, and accuracy results than the SVM model. The accuracy obtained by the PHOG-1-KNN model was 75%, PHOG-2-KNN increased to 78%, and there was a decrease in PHOG-3-KNN to 76%. These results indicate that at each PHOG level the SVM model is better than the KNN model.
When viewed from the aspect of computation time based on table 6, the KNN model is superior to the SVM model in the training process. For example, the PHOG-1-KNN model training process takes 3.41 seconds, while the PHOG-1-SVM takes 353.26 seconds. The SVM model is superior to the KNN model during the testing process. It can be seen that the PHOG-1-SVM model takes 35.85 seconds, while the PHOG-1-KNN takes 113.84 seconds. Overall, the KNN model is far superior to the SVM model when viewed from the total computation time required. The PHOG-1-KNN model has a total of 117.25 seconds, while the PHOG-1-SVM Model takes 389.11 seconds. Based on table 6, the total computation time of the KNN model is 78.27 seconds, while the SVM model is 236.53 seconds at PHOG level 3. Looking from another point of view, the SVM and KNN models have the same computation time pattern where the higher the PHOG level, the faster the computation time, so it can be concluded that PHOG level affects the computation time of the two models.

Conclusion
Experiments in this study with previous research there are differences in the results of confusion matrix, precision, recall, f1-score, and accuracy. The first factor that causes the difference in the results of the two studies is the selection of the image dataset from 13000, only 3786 images were selected, it is very possible that the image used in the previous study was not the same as the research carried out. The second factor is that previous research did not explicitly mention the SVM configuration used, so in this study, the authors tried several configurations and the best results were obtained on the SVM configuration with a polynomial kernel, degree 3, and tolerance 0.00001.
Based on its accuracy, the SVM method is superior to the KNN method at all PHOG levels. PHOG1-SVM produces an accuracy of 81%, while PHOG-1-KNN is only 75%. The PHOG-2-SVM produces 81% accuracy, while the PHOG-2-KNN has an accuracy of 78%. PHOG-3-SVM experienced an increase in accuracy to 82%, while PHOG-3-KNN experienced a decrease in accuracy to 76%. Based on the total computation time required for training and testing, the KNN method is superior to the SVM method at all PHOG levels. The KNN method requires the fastest total computation time of 78.27 seconds at PHOG level 3, while the SVM method is 236.53 seconds at PHOG level 3.