Machine Learning Techniques

We provide a brief description of the two used machine learning techniques as follows:

Decision tree is a non-parametric supervised algorithm that learns from simple decision rules inferred from the data features.

Random forest is an ensemble learning method for classification that operates by constructing a multitude of decision trees at training time.

KNN (k-nearest neighbors algorithm) is a non-parametric supervised learning method. The input consists of the k closest training examples in a data set.

Balancing Techniques

Below, we briefly describe the seven balancing techniques of our study:

Under-sampling balances the dataset by randomly reducing the size of the majority class until it has the size of the minority class.

Over-sampling balances the dataset by randomly increasing the size of the minority class until it has the size of the majority class.

Both-sampling is a mix of under- and over-sampling. It randomly reduces the majority class and increases the minority class until the sample has a size of around the initial majority plus the minority class divided by two.

SMOTE (Synthetic Minority Oversampling Technique) synthesizes elements for the minority class in the vicinity of already existing elements, similar to over-sampling [1].

BorderlineSmote is a variant of the original SMOTE algorithm; however, in borderlineSmote, samples will be detected and used to generate new synthetic samples [2].

SVMSmote is a variant of SMOTE algorithm that use an SVM algorithm to detect samples for generating new synthetic samples [3].

Adasyn (Adaptive Synthetic) is an algorithm that generates synthetic data. Its greatest advantages are not copying the same minority data and generating more data for “harder to learn” examples. Adasyn is similar to SMOTE, but it generates different samples depending on a local distribution estimation of the oversampled class.

References

[1] N. Chawla, K. Bowyer, L. Hall, and W. Kegelmeyer. “SMOTE: Synthetic Minority Over-sampling Technique”. Journal of Artificial Intelligence, 16, pp. 321–357, 2002.

[2] H. Han, W. Wen-Yuan, M. Bing-Huan, “Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning”. Advances in Intelligent Computing, pp. 878–887, 2005.

[3] H. M. Nguyen, E. W. Cooper, K. Kamei. “Borderline over-sampling for imbalanced data classification”. International Journal of Knowledge Engineering and Soft Data Paradigms, v. 3(1), pp.4–21, 2009.