Methodology

Our system is built in such a way that answers for test dataset is not exposed to model author. Algorithm is presented with only train answers, and tested on test data set automatically.

Some examples of train/test splits:

  • random split: train set of 437651 individuals and test set of 48684 individuals who are randomly chosen from set of all people with SNIP’s data available;
  • population split: train set of white british people, and test set with all other people;
  • population exome split: same as population split but with individuals who has exome data.

We solve two types of tasks: classification and regression.

Classification tasks is used to predict binary phenotype, such as disease or some specific individual feature that could be expressed as binary(true/false of 1/0) value. For example:

  • self-reported psoriasis;
  • self-reported asthma;
  • self-reported prostate cancer.

Regression tasks is used to predict continuous scalar phenotype, such as height or weight. For example:

  • height;
  • weight;
  • estimated heel-bone mineral density.

There are different metrics evaluated for these tasks on test set. For example:

  • Log loss;
  • R squared;
  • Precision;
  • Recall;
  • F1;
  • ROC AUC;
  • RMSE.

Algorithm could use genetic and/or phenotype features to make predictions. In case of phenotype features temporal restrictions are applied.

For example if you take individual with some concrete binary phenotype(such as disease), then this phenotype could be diagnosed at the certain moment of time. After that moment a lot of consequences originated from that particular phenotype detail could arise and lead to algorithm overfit/shortcut learning. In order to minimize this effect one should obtain all features of this individual based on that certain moment or some earlier moment. This rule means that you need to build phenotype features according to the target phenotype that you're trying to predict.