Artificial intelligence

Artificial intelligence specifically designed for the analysis of biomedical data

Marco F. Schmidt GmbH c/o TechCode Jägerallee 16 14469 Potsdam, Germany

With long and cost-intensive and especially a high number of unsuccessful clinical trials (especially phases II and III), drug development is on the verge of losing its profitability.1 Clinical trials mainly fail because a) the addressed drug target turns out to be not central to the disease mechanism and b) the wrong patients are selected for clinical trials.2

One way to improve the success rates is through genetic biomarkers gained from the analysis of existing biomedical data for drug target linkage and, more importantly, for patient stratification.3 However, biomedical data are difficult to analyze due to their structure of small sample sizes and many features (so-called high-dimensionality).

Most current genetic biomarkers rely on single gene variants and are, thus, limited in their predictive power. Because many diseases come into place due to complex interactions of multiple gene variants, it seems reasonable to strive towards multigene, so-called polygenic biomarkers. The problematic nature of biomedical data (high-dimensionality: small sample size, many features), however, makes the discovery of polygenic markers most difficult.

Maschinelle Lernen

Fig. 1.’s machine learning identifies complex interaction in biomedical data, yielding Next-Generation Biomarkers with outstanding accuracy and sensitivity rates compared to existing single feature biomarkers. developed a machine learning approach specifically designed for the discovery of complex interactions in biomedical / genomic data. In contrast to standard biomarker discovery technologies,’s machine learning platform was designed to reliably find complex interactions in high-dimensionality data (see Fig. 1). This enables the discovery of polygenic, Next-Generation Biomarkers (NGBs) that are much more powerful than their current single-gene counterparts: For complex diseases such as Late-Onset Alzheimer’s Disease (LOAD), the currently used single gene (monogenic) biomarker APOE4 displays an accuracy rate below 60% (50% being random). In contrast,’s LOAD Next-Generation Biomarker makes use of polygenic interactions and has an accuracy of 85%.

By using’s Next-Generation Biomarkers with their superior accuracy and sensitivity rates in prediction, the right patients can be selected for clinical trials phase II and III, making the drug development process faster, expensive and, most importantly, more successful.

As mentioned, biomedical data show the problematic structure of small patient numbers, often less than 500, in relation to the examined characteristics, often exceeding 1 Million. The analysis of such high-dimensionality data, especially when testing for interactions, are prone to result in a high rate of false-positives, due to the multiple comparison problem in statistics: If many data series are compared, similarly convincing, but coincidental results may be obtained.

The uniqueness of our machine learning platform is its ability to mitigate the multiple testing problem, i.e. keep the real effects, but exclude the ‘false alarms’. It does so by leveraging structural and mathematical properties of biomedical data as well as contextual information mined from biomedical journals and databases. The approach works and each NGB has been validated on up to 5 completely independent datasets (out-of-sample prediction) of the same disease using standard machine learning metrics such as accuracy and precision and recall (often referred to as sensitivity in the medical field).

The use of these metrics (see explanation and example in Tab. 1) also enables comparing our Next-Generation Biomarkers with the markers of other companies who claim that they use AI to generate better biomarkers - we can demonstrate that our Next-Generation Biomarkers objectively outperform those of our competitors.

Tab. 1. The prediction quality of a biomarker can be determined by precision and recall analysis following the equations in the last two lines.






True Negative (TN)

False Positive (FP)


False Negative (FN)

True Positive (TP)

Precision = TP / (FP + TP)

Example LOAD biomarker: 85% compared to 60% of APOE4

Recall (Sensitivity) = TP / (FN + TP)

Example LOAD biomarker: 57% compared to 12% of APOE4

In conclusion, in data from genome-wide association studies (GWAS) identified gene interactions which resulted in a mathematical model that predicts Late-onset Alzheimer's Disease with an accuracy of 85%. In other words: Our approach predicts the disease status of 9 out of 10 patients correctly, whereas the standard APOE4 test only predicts less than 6 out of 10 correctly. In the last 12 months we analyzed several datasets from customers, our own, or from academic collaborations resulting in a portfolio of seven NGBs for various diseases. Each one offers an improvement over the corresponding monogenic biomarker currently in clinical use.


Deloitte: A new future for R&D? Measuring the return from pharmaceutical innovation 2017
Cook et al., Nat Rev Drug Discov, 2014,13, 419-431
Nelson MR et al.,Nat Genetics,2015,47,856-860

de_DEGerman en_GBEnglish