22–23 May 2026
Sibiu, Romania
Europe/Bucharest timezone

ETL procedures and acceptable types of analysis on a cardiovascular disease dataset

22 May 2026, 15:10
10m
ARSENAL Room (Mercure Sibiu Arsenal)

ARSENAL Room

Mercure Sibiu Arsenal

Speaker

Julian Vasilev (University of Economics Varna)

Description

Introduction: Cardiovascular diseases remain a leading cause of mortality worldwide, which makes early risk identification a major clinical and public health priority. Reliable risk estimation depends not only on the selected model, but also on the quality and consistency of the underlying data.
Purpose: The purpose of this paper is to examine the needed ETL procedures for preparation the Framingham cardiovascular dataset for subsequent statistical analysis and risk assessment.
Methods: The study is based on a publicly available cardiovascular dataset containing demographic, anthropometric, biochemical, and behavioural variables. Considering the available fields, the Framingham Risk Score was selected as the most suitable reference model for further analysis. The ETL workflow consists of identifying and treating missing values, standardising measurement units, reconstructing incomplete variables from related fields, deriving additional indicators, and converting categorical variables into numerical formats suitable for analysis. In addition to data preparation, the study outlines a set of acceptable preliminary analyses, including distributional checks, screening for potential outliers, correlation analysis, and tests of association between selected categorical risk factors and estimated cardiovascular risk.
Practical implications: The initial exploratory analyses suggest that the transformed data behave in a clinically plausible way and that the main variables are suitable for subsequent statistical use. Early results also suggest that the prepared dataset captures expected patterns between established cardiovascular risk factors and estimated risk levels, while remaining appropriate for further modeling and validation.
Findings: The study supports the view that ETL is not merely a technical preprocessing step, but a methodological prerequisite for valid secondary analysis of clinical data. The proposed workflow demonstrates how careful transformation and validation can improve the usability of real-world health data and define which analytical procedures are appropriate at the preliminary stage. The resulting dataset is intended to serve as a foundation for future predictive modeling, comparative evaluation of risk estimation approaches, and possible adaptation of the workflow to other clinical datasets.

Primary authors

Julian Vasilev (University of Economics Varna) Mrs Petya Penkova (University of Economics Varna)

Presentation materials

There are no materials yet.