Machine Learning Approach to Identifying the Dataset Threshold for the Performance Estimators in Supervised Learning

Document Type



Institute for Educational Development, East Africa


Currently for small-scale machine learning projects, there is no limit which has been set by its researchers to categorise datasets for inexperienced users such as students while assessing and comparing performance of machine learning algorithms. Based on the lack of such a threshold, this paper presents a step by step guide for identifying the dataset threshold for the performance estimators in supervised machine learning experiments. The identification of the dataset threshold involves performing experiments using four different datasets having different sample sizes from the University of California Irvine (UCI) machine learning repository. The sample sizes are categorised in relation to the number of attributes and number of instances available in the dataset. The identified dataset threshold will help unfamiliar machine learning experimenters to categorise datasets correctly and hence selecting the appropriate performance estimation method


This work was published before the author joined Aga Khan University.


International Journal for Infonomics (IJI)