Statistical Relational Learning to Predict Primary Myocardial Infarction from Electronic Health Records

Jeremy C. Weiss, Sriraam Natarajan, Peggy L. Peissig, Catherine McCarty, and David Page, IAAI 2012

Myocardial Infarctions (generally known as heart attacks) causes one in three deaths in the United States and unsurprisingly have the most mysterious trajectory. It has been established that the prediction of future MIs is a challenging task and hence there have been extensive studies to identify and/or quantify the risk factors that contribute to MIs. Few common risk factors that have been identified are age, gender, blood pressure, low-density lipoprotein (LDL) cholesterol, diabetes, obesity, inactivity, alcohol and smoking. The canonical method of study in this field using: case-control studies, cohort studies, and randomized controlled trials concentrated on one risk factor at a time. So, the natural way forward is to analyze effects of multiple factors at a time and question is can we do it using machine learning? Also, Electronic Health Records (EHR) is used to over come the limitation of the data collected in previous studies.

The limitation of the previous data is that in these studies the risk factors are established at t0 and data is collected at the onset of the study, and then annual check-ups are conducted to assess patient health and determine occurrence of an MI event. The patients who did not possess risk factors at time t0 and developed them at later time were considered as not possessing that risk throughout the analysis. On the contrary, EHRs provides information of the development of risk factors as it tracks the health trajectories of its patients through time and hence provides an unique advantage for risk modelling. It can help us create a risk profile similar to Framingham Risk Score (FRS) without medical interventions (like additional laboratory tests). Also, as FRS works better for Caucasians than other populations it is biased. Hence, EHR risk profiles would be more reliable score than FRS.

This paper approaches the task of prediction and risk stratification of MI from EHRs using two Statistical Relational Learning (SRL) methods Relational Probability Trees (RPT) and Relational Functional Gradient Boosting (RFGB). RPTs upgrade decision trees to relational setting and RFGBs upgrade FGBs to relational setting. For RPT, paper uses Tilde relational regression learner (RRT) for positive examples to learn tree whose leaves have regression values which gives the probability of MI occurrence. Inner nodes in this tree represent conjunctions of literals (maximum two literals). FGBs fit the regression tree on training examples at each gradient step, for RFGBs propositional regression trees are replaced with relational regression trees.

Experiments were performed on de-identified EHR data of 18,386 subjects. Total of 1,528 binary features were chosen a priori from relational tables for diagnoses, medications, labs, procedures, vitals, and demographics. This included major risk factors, common risk factors, drugs and patient relations. Also, features were discretized (e.g. for blood pressure, we created five binary features by mapping the real value to critically high, high, normal, low, and critically low). The training data was split in a way that 1:1 ratio is achieved for positive to negative examples. The paper compares RFGB, RPT, Boosted Decision Trees, Decision Trees, Naive Bayes, Tree Augmented Naive Bayes, Support Vector Machines (SVM) with linear kernel, SVMs with radial basis function kernel and Random Forests. The results show that accuracies of all the algorithms are comparable. However, in the medical domain as it is more important to avoid false negative than false positive, better precision @ high recall is much more significant measure than accuracy. Experiments show that RFGB performs way better than other approaches for precision at high recall.

Key contribution

  1. They approach the problem of predicting MIs in real patients and identifying ways in which machine learning can contribute to clinical studies.
  2. It establishes that the each relational learner can out perform its propositional variants even for the large scale domains like EHRs and provide interpretable results.
  3. It introduces task of MI prediction to SRL community.


  1. Features in this experiment were chosen a priori even for the algorithms that employ feature selection for computational reasons and to compare it with algorithms that do not employ feature selection.
  2. Features were discretized and not used in their natural form.
  3. Relational information such as hierarchies present in the EHR for diagnoses, drugs, and laboratory values are not considered in this experiments.


The paper highlights a very important limitation of the data collected in clinical trials and signifies the importance of EHRs. The interpretability achieved by these models is very important, especially in the medical domain.

The paper claims that its key contribution is that they address the challenging problem of predicting MI in real patients and identify ways in which machine learning can augment current methodologies in clinical studies.Though this is the first paper to use SRL methods to predict MI in real patients, it is not the first paper to use machine learning to predict/detect MI. "Kernel-based Support Vector Machine classifiers for early detection of myocardial infarction" by D. Conforti, and R. Guido pre-dates this paper. Though it is unclear to me whether or not prediction and early detection are synonymous, I would have liked to see some mention of a previous ML works as a related work section.