Prediction of the 10-year Cardiovascular Heart Disease
Using the Framingham Heart Study Data
Technology for Humanity
Chapter 2. FHS Data
Step 1. Prerequisite
This application has been developed using Django (4.2.0) as the backend server, HTML & Javascript as the frontend. To replicate this application, all required libraries needs to be installed.
Step 2. Import libraries (or packages)
First, import libraries.
import os
from main import settings
os.chdir(settings.BASE_DIR)
from tabulate import tabulate
import pandas as pd
import seaborn as sns
Step 3. Import data (in Excel csv format)
The original data can be downloaded from "https://www.kaggle.com/datasets/aasheesh200/framingham-heart-study-dataset". The data was saved in a local directory.
df = pd.read_csv('./data/framingham/dataset/framingham.csv')
The csv file was imported to Pandas dataframe. Pandas is a python library for database management and numerical analysis. Pandas can read/write data in many file formats, therefore, the portability is excellent. However, Pandas does not include the meta file in the dataset unlike some statistical packages (e.g., SPSS, R, Stata). Pandas is appropriate for manipulating relatively small dataset (n < a million). When the data size is huge (i.e., n > a million), other file formats (e.g., Polars, Arrow, Parquet, etc.) are preferred.
Table 1. Data description
Category |
Variable |
Code |
Description |
Demographic |
male |
1: Yes 0: No |
Subject's sex |
|
age |
32 - 70 years |
Age of the patient;(Continuous - Although the recorded ages have been truncated to whole numbers, the concept of age is continuous) |
|
education |
1 - 4 |
no further information provided |
Behavioral |
currentSmoker |
1: Yes 0: No |
whether or not the patient is a current smoker (Nominal) |
|
cigsPerDay |
0 - 60 cigarrttes |
the number of cigarettes that the person smoked on average in one day.(can be considered continuous as one can have any number of cigarettes, even half a cigarette.) |
Medical history: |
BPMeds |
1: Yes 0: No |
whether or not the patient was on blood pressure medication (Nominal) |
|
prevalentStroke |
1: Yes 0: No |
whether or not the patient had previously had a stroke (Nominal) |
|
prevalentHyp |
1: Yes 0: No |
whether or not the patient was hypertensive (Nominal) |
|
diabetes |
1: Yes 0: No |
whether or not the patient had diabetes (Nominal) |
Current medical condition |
totChol |
100 - 512 mg/dL |
total cholesterol level (Continuous) |
|
sysBP |
70 - 295 mmHg |
systolic blood pressure (Continuous) |
|
BMI |
45 - 145 mmHg |
Body Mass Index (Continuous) |
|
BMI |
15.5 - 56.8 kg/m^2 |
Body Mass Index (Continuous) |
|
heartRate |
44 - 143 pulse/min |
In medical research, variables such as heart rate though in fact discrete, yet are considered continuous because of large number of possible values. |
|
glucose |
40 - 394 mg/dL |
glucose level (Continuous) |
Target variable to predict |
TenYearCHD |
1: Yes 0: No |
10 year risk of coronary heart disease (CHD) |
Step 3. Check data
Note: Variable names were shortened to reduce the widths.
df.columns.to_list()
keys = ['male', 'age', 'educ', 'Smoker', 'n_cigs','BPMeds', 'Stroke',
'Hyp', 'dm', 'Chol', 'sysBP', 'diaBP', 'BMI', 'Rate', 'gluc', 'CHD']
print(tabulate(df.head(10), headers='keys'))
Table 2. Dataset (1st 10 records)
male |
age |
educ |
Smoker |
cigs |
Meds |
Stro |
Hyp |
dm |
Chol |
SBP |
DBP |
BMI |
Rate |
gluc |
CHD |
1 | 39 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 195 | 106 | 70 | 26.97 | 80 | 77 | 0 |
0 | 46 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 250 | 121 | 81 | 28.73 | 95 | 76 | 0 |
1 | 48 | 1 | 1 | 20 | 0 | 0 | 0 | 0 | 245 | 127.5 | 80 | 25.34 | 75 | 70 | 0 |
0 | 61 | 3 | 1 | 30 | 0 | 0 | 1 | 0 | 225 | 150 | 95 | 28.58 | 65 | 103 | 1 |
0 | 46 | 3 | 1 | 23 | 0 | 0 | 0 | 0 | 285 | 130 | 84 | 23.1 | 85 | 85 | 0 |
0 | 43 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | 228 | 180 | 110 | 30.3 | 77 | 99 | 0 |
0 | 63 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 205 | 138 | 71 | 33.11 | 60 | 85 | 1 |
0 | 45 | 2 | 1 | 20 | 0 | 0 | 0 | 0 | 313 | 100 | 71 | 21.68 | 79 | 78 | 0 |
1 | 52 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 260 | 141.5 | 89 | 26.36 | 76 | 79 | 0 |
1 | 43 | 1 | 1 | 30 | 0 | 0 | 1 | 0 | 225 | 162 | 107 | 23.61 | 93 | 88 | 0 |
# data shape: Check the number of records and variables
df.shape
Out[25]: (4240, 16)
data types - This is important!
df.dtypes
Table 3. Data types
Variable |
Type |
male |
int64 |
age |
int64 |
education |
float64 |
currentSmoker |
int64 |
cigsPerDay |
float64 |
BPMeds |
float64 |
prevalentStroke |
int64 |
prevalentHyp |
int64 |
diabetes |
int64 |
totChol |
float64 |
sysBP |
float64 |
diaBP |
float64 |
BMI |
float64 |
heartRate |
float64 |
glucose |
float64 |
TenYearCHD |
int64 |
dtype |
object |
# check for dupicates
duplicate_df = df[df.duplicated()]
duplicate_df.value_counts().sum()
Out[13]: 0
# checking for missing values
missing = df.isna().sum().to_frame(name='Missing')
print(tabulate(missing, headers=['Variable', 'Missing'], tablefmt="html"))
null = df[df.isna().any(axis=1)]
null
Table 4. Frequency of missing values
Variable | Missing |
male | 0 |
age | 0 |
education | 105 |
currentSmoker | 0 |
cigsPerDay | 29 |
BPMeds | 53 |
prevalentStroke | 0 |
prevalentHyp | 0 |
diabetes | 0 |
totChol | 50 |
sysBP | 0 |
diaBP | 0 |
BMI | 19 |
heartRate | 1 |
glucose | 388 |
TenYearCHD | 0 |
Continue to Chapter 3. FHS Data Clean Up