## Thesis etd-03252024-145940 |

Link copiato negli appunti

Thesis type

Tesi di laurea magistrale

Author

ORLANDINI, FRANCESCA

URN

etd-03252024-145940

Thesis title

Explainable electronic health records survival analysis with a neural network approach

Department

INGEGNERIA DELL'INFORMAZIONE

Course of study

INGEGNERIA BIOMEDICA

Supervisors

**relatore**Prof. Vozzi, Giovanni

**relatore**Prof. Positano, Vincenzo

**relatore**Ing. De Santi, Lisa Anita

Keywords

- Cox regression
- deep learning
- eXplainable Artificial Intelligence
- neural networks
- PyTorch
- survival analysis
- Thalassemia Major

Graduation session start date

18/04/2024

Availability

Full

Summary

Survival analysis encompasses different methods used to find the time until a patient experiences an event, such as death. The classical techniques used for this task are the non-parametric Kaplan-Meier estimator and the semi-parametric Cox proportional hazards regression model.

The Cox proportional hazards model returns the hazard risk at time t for a subject with a set of covariates x1,. . . ,xp. The hazard risk can be expressed by the product of the term h0(t), defined as baseline hazard function, and another term, called partial hazard, which is the exponential of the linear combination of the covariates. β1,. . . ,βp are the model parameters describing the effect of the covariates on the hazard risk.

Once the hazard risk h(t|X) has been obtained, the survival curve S(t) can be found by calculating the exponential of the cumulative hazard function H(t|X).

These statistical methods have the advantage of being able to predict not only the probability of a particular event, but also the time to the event. This may have great clinical utility since it can help doctors to triage patients very quickly.

However, these models have several limitations: they are founded on a linear relationship assumption between covariates and they are not suitable for high-dimensional complex data. For this motivation, in recent years deep learning started to become increasingly popular for survival analysis, in particular in the healthcare research.

In the present work, we applied the Cox neural network (Cox-net) to overcome the limits of the classical Cox regression.

We trained the model using data from patients suffering from Thalassemia Major (TM) collected during a clinical trial organized by the Myocardial Iron Overload in Thalassemia (MIOT) project network. The MIOT dataset has the characteristics of a typical database for survival analysis: it presents both continuous and categorical variables, censoring variables and time variables.

A censoring variable is a categorical variable, which describes whether the patient is censored (the event of interest has not occurred) or not; a time variable is a continuous variable, which represents the time at which the event has occurred or the last time that a patient was seen, for those who did not have the event.

Data collected are constituted by 92% censored patients.

Moreover, eight synthetic populations, with a different percentage of censored patients (from 0% to 98%), were created with the R software to understand whether the Cox-net training and convergence are influenced by the number of censored patients. The datasets were created as similar as possible to the MIOT population: we set coefficients β equal to those resulting from the Cox regression and each variable has the same distribution as the corresponding covariate in the real dataset.

Before starting the training, standardization was applied to the variables of all the datasets to avoid the exploding gradient problem.

For the training, we applied stratified 3-fold cross-validation to manage datasets unbalance, and applied early stopping to prevent overfitting.

We evaluated the model using the concordance index (c-index), the most common metric used in survival analysis.

We chose the hyperparameter settings to optimize prediction performances on the validation-set. The SGD Nesterov optimizer was applied and the ReLU activation function was chosen. A learning rate of 0.00001, a dropout of 0.4 and a momentum of 0.9 were chosen. λ ridge, which is the regularisation parameter, was set equal to 0.0005.

The definitive model is characterized by three layers: the first two are classic fully-connected layers with 256 nodes, while the output layer is a fully-connected layer without the bias term and it consists of a single node.

The input layer is characterized by 13 nodes, as the number of features for each patient. Negative log partial likelihood with L2 regularization was used as loss, in order to obtain the log partial hazard of the Cox regression as output.

Then, we created a synthetic Gaussian dataset to demonstrate the Cox-net assumption of being able to understand non-linear relationships between covariates, unlike the classical Cox regression.

Afterwards, we applied the classical Cox regression and Kaplan-Meier estimator using the Python library lifelines as baselines to assess the coherence of the Cox-net log partial hazards outputs with the ones resulting from the Cox regression. Then, we produced the survival curves S(t).

Moreover, we analyzed the most influential patient features on the Cox-net output.

Since the Cox-net black-box nature does not allow us to understand how and why it obtained the log partial hazard, we applied two eXplainable Artificial Intelligence (XAI) algorithms: Permutation Feature Importance and SHapley Additive exPlanations (SHAP).

In the end, starting from an already existing CNN which classifies myocardial iron overload levels from multi-echo T2* images, we developed a model which can extract the features of the last layer.

Our results demonstrate that, considering the MIOT dataset, c-index of the Cox-net (0.812 ± 0.036) is a little higher than the c-index of classical Cox regression (0.790 ± 0.040) calculated on the same validation-set.

Moreover, since c-index is not stable during the three trainings (SD = 0.036), we demonstrated that this problem derives from the high number of censored patients, thanks to the synthetic datasets. We discovered that the standard deviation increased from 0.009 with a dataset without censored patients to 0.054 with a dataset characterized by 98% censored patients.

Additionally, it was noticed that the Cox-net is less sensitive to the number of censored patients: the average c-index of the latter remains above 0.8 for all the datasets, while the average c-index of the Cox regression drops below 0.8 already when censored patients are 35% of the total.

Considering the ideal dataset without censored patients, we compared the Cox-net survival curve with the Kaplan-Meier survival curve of the same test set. In this case, the curve obtained by the Kaplan-Meier estimator represents the ground truth. Since the curve obtained with the output risks from the Cox-net is very close to the latter, we understood that the hazard risks predicted by the neural network are rather close to the real risks.

Both Permutation Feature Importance and SHAP show that the most important predictor variables are, first of all, fibrosis and then sex for all the datasets. These results agree with those obtained from the classical Cox regression and are also clinically significant, since it is well-known that male patients and patients with fibrosis have a higher risk of developing a cardiac disease.

Therefore, two different survival curves were obtained for each dataset: one for the patients with fibrosis, and the other for the patients without fibrosis. We noticed that both the Cox-net and the Cox regression are able to clearly separate the group with fibrosis from the group without fibrosis.

Additionally, considering the synthetic datasets results, it can be seen that the two curves are less distinguishable as the number of censored patients grows.

Finally, we repeated the same procedure considering sex: male patients can be distinguished from female patients. Even in this case, we can notice that the two curves are less separated when the number of censored patients is higher.

The work carried out in this master’s thesis gives the possibility for a future development. An advantage of having realised a neural network model is that images features can be provided as input in addition to clinical data.

Indeed, in contrast to the classical Cox regression, the Cox-net is able to work with a large amount of data. An important development could be to verify whether adding image features leads to a huge improvement of the performance of the Cox-net.

The Cox proportional hazards model returns the hazard risk at time t for a subject with a set of covariates x1,. . . ,xp. The hazard risk can be expressed by the product of the term h0(t), defined as baseline hazard function, and another term, called partial hazard, which is the exponential of the linear combination of the covariates. β1,. . . ,βp are the model parameters describing the effect of the covariates on the hazard risk.

Once the hazard risk h(t|X) has been obtained, the survival curve S(t) can be found by calculating the exponential of the cumulative hazard function H(t|X).

These statistical methods have the advantage of being able to predict not only the probability of a particular event, but also the time to the event. This may have great clinical utility since it can help doctors to triage patients very quickly.

However, these models have several limitations: they are founded on a linear relationship assumption between covariates and they are not suitable for high-dimensional complex data. For this motivation, in recent years deep learning started to become increasingly popular for survival analysis, in particular in the healthcare research.

In the present work, we applied the Cox neural network (Cox-net) to overcome the limits of the classical Cox regression.

We trained the model using data from patients suffering from Thalassemia Major (TM) collected during a clinical trial organized by the Myocardial Iron Overload in Thalassemia (MIOT) project network. The MIOT dataset has the characteristics of a typical database for survival analysis: it presents both continuous and categorical variables, censoring variables and time variables.

A censoring variable is a categorical variable, which describes whether the patient is censored (the event of interest has not occurred) or not; a time variable is a continuous variable, which represents the time at which the event has occurred or the last time that a patient was seen, for those who did not have the event.

Data collected are constituted by 92% censored patients.

Moreover, eight synthetic populations, with a different percentage of censored patients (from 0% to 98%), were created with the R software to understand whether the Cox-net training and convergence are influenced by the number of censored patients. The datasets were created as similar as possible to the MIOT population: we set coefficients β equal to those resulting from the Cox regression and each variable has the same distribution as the corresponding covariate in the real dataset.

Before starting the training, standardization was applied to the variables of all the datasets to avoid the exploding gradient problem.

For the training, we applied stratified 3-fold cross-validation to manage datasets unbalance, and applied early stopping to prevent overfitting.

We evaluated the model using the concordance index (c-index), the most common metric used in survival analysis.

We chose the hyperparameter settings to optimize prediction performances on the validation-set. The SGD Nesterov optimizer was applied and the ReLU activation function was chosen. A learning rate of 0.00001, a dropout of 0.4 and a momentum of 0.9 were chosen. λ ridge, which is the regularisation parameter, was set equal to 0.0005.

The definitive model is characterized by three layers: the first two are classic fully-connected layers with 256 nodes, while the output layer is a fully-connected layer without the bias term and it consists of a single node.

The input layer is characterized by 13 nodes, as the number of features for each patient. Negative log partial likelihood with L2 regularization was used as loss, in order to obtain the log partial hazard of the Cox regression as output.

Then, we created a synthetic Gaussian dataset to demonstrate the Cox-net assumption of being able to understand non-linear relationships between covariates, unlike the classical Cox regression.

Afterwards, we applied the classical Cox regression and Kaplan-Meier estimator using the Python library lifelines as baselines to assess the coherence of the Cox-net log partial hazards outputs with the ones resulting from the Cox regression. Then, we produced the survival curves S(t).

Moreover, we analyzed the most influential patient features on the Cox-net output.

Since the Cox-net black-box nature does not allow us to understand how and why it obtained the log partial hazard, we applied two eXplainable Artificial Intelligence (XAI) algorithms: Permutation Feature Importance and SHapley Additive exPlanations (SHAP).

In the end, starting from an already existing CNN which classifies myocardial iron overload levels from multi-echo T2* images, we developed a model which can extract the features of the last layer.

Our results demonstrate that, considering the MIOT dataset, c-index of the Cox-net (0.812 ± 0.036) is a little higher than the c-index of classical Cox regression (0.790 ± 0.040) calculated on the same validation-set.

Moreover, since c-index is not stable during the three trainings (SD = 0.036), we demonstrated that this problem derives from the high number of censored patients, thanks to the synthetic datasets. We discovered that the standard deviation increased from 0.009 with a dataset without censored patients to 0.054 with a dataset characterized by 98% censored patients.

Additionally, it was noticed that the Cox-net is less sensitive to the number of censored patients: the average c-index of the latter remains above 0.8 for all the datasets, while the average c-index of the Cox regression drops below 0.8 already when censored patients are 35% of the total.

Considering the ideal dataset without censored patients, we compared the Cox-net survival curve with the Kaplan-Meier survival curve of the same test set. In this case, the curve obtained by the Kaplan-Meier estimator represents the ground truth. Since the curve obtained with the output risks from the Cox-net is very close to the latter, we understood that the hazard risks predicted by the neural network are rather close to the real risks.

Both Permutation Feature Importance and SHAP show that the most important predictor variables are, first of all, fibrosis and then sex for all the datasets. These results agree with those obtained from the classical Cox regression and are also clinically significant, since it is well-known that male patients and patients with fibrosis have a higher risk of developing a cardiac disease.

Therefore, two different survival curves were obtained for each dataset: one for the patients with fibrosis, and the other for the patients without fibrosis. We noticed that both the Cox-net and the Cox regression are able to clearly separate the group with fibrosis from the group without fibrosis.

Additionally, considering the synthetic datasets results, it can be seen that the two curves are less distinguishable as the number of censored patients grows.

Finally, we repeated the same procedure considering sex: male patients can be distinguished from female patients. Even in this case, we can notice that the two curves are less separated when the number of censored patients is higher.

The work carried out in this master’s thesis gives the possibility for a future development. An advantage of having realised a neural network model is that images features can be provided as input in addition to clinical data.

Indeed, in contrast to the classical Cox regression, the Cox-net is able to work with a large amount of data. An important development could be to verify whether adding image features leads to a huge improvement of the performance of the Cox-net.

File

Nome file | Dimensione |
---|---|

TesiFran...ndini.pdf | 7.69 Mb |

Contatta l’autore |