Skip to main content
Fig. 1 | BioData Mining

Fig. 1

From: Ten quick tips for machine learning in computational biology

Fig. 1

a Example of dataset feature which needs data pre-processing and cleaning before being employed in a machine learning program. All the feature data have values in the [0;0.5], except an outlier having value 80 (Tip 1). b Representation of a typical dataset table having N features as columns and M data instances as rows. An effective ratio for the split of an input dataset table: 50% of the data instances for the training set; 30% of the data instances for the validation set; and the last 20% of the data instances for the test set (Tip 2). c Example of a typical biological imbalanced dataset, which can contain 90% negative data instances and only 10% positive instances. This aspect can be tackled with under-sampling and other techniques (Tip 5)

Back to article page