Our system is built in such a way that answers for test dataset is not exposed to model author. Algorithm is presented with only train answers, and tested on test data set automatically.
Some examples of train/test splits:
We solve two types of tasks: classification and regression.
Classification tasks is used to predict binary phenotype, such as disease or some specific individual feature that could be expressed as binary(true/false of 1/0) value. For example:
Regression tasks is used to predict continuous scalar phenotype, such as height or weight. For example:
There are different metrics evaluated for these tasks on test set. For example:
Algorithm could use genetic and/or phenotype features to make predictions. In case of phenotype features temporal restrictions are applied.
For example if you take individual with some concrete binary phenotype(such as disease), then this phenotype could be diagnosed at the certain moment of time. After that moment a lot of consequences originated from that particular phenotype detail could arise and lead to algorithm overfit/shortcut learning. In order to minimize this effect one should obtain all features of this individual based on that certain moment or some earlier moment. This rule means that you need to build phenotype features according to the target phenotype that you're trying to predict.